Rainbows! Rack HTTP server user/dev discussion
 help / color / Atom feed
* Unicorn is killing our rainbows workers
@ 2012-07-18 18:52 Samuel Kadolph
       [not found] ` <CAFFC5+MUdUoXhBXvw8VnnVAZsQpN1idELr0nc_Xm0HYcdtQVhA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 17+ messages in thread
From: Samuel Kadolph @ 2012-07-18 18:52 UTC (permalink / raw)
  To: rainbows-talk-GrnCvJ7WPxnNLxjTenLetw

Hey rainbows-talk,

We have 40 servers that each run rainbows with 2 workers with 100
threads using ThreadPool. We're having an issue where unicorn is
killing the worker process. We use ThreadTimeout (set to 70 seconds)
and originally had the unicorn timeout set to 150 seconds and we're
seeing unicorn eventually killing each worker. So we bumped the
timeout to 300 seconds and it took about 5 minutes but we started
seeing unicorn starting to kill workers again. You can see our stderr
log file (timeout at 300s) at
https://gist.github.com/9ec96922e55a59753997. Any insight into why
unicorn is killing our ThreadPool workers would help us greatly. If
you require additional info I would be happy to provide it.

Samuel Kadolph
samuel.kadolph-BqItboTaHx1BDgjK7y7TUQ@public.gmane.org
16134043579
_______________________________________________
Rainbows! mailing list - rainbows-talk-GrnCvJ7WPxnNLxjTenLetw@public.gmane.org
http://rubyforge.org/mailman/listinfo/rainbows-talk
Do not quote signatures (like this one) or top post when replying


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Unicorn is killing our rainbows workers
       [not found] ` <CAFFC5+MUdUoXhBXvw8VnnVAZsQpN1idELr0nc_Xm0HYcdtQVhA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2012-07-18 19:20   ` Jason Lewis
  2012-07-18 21:52   ` Eric Wong
  1 sibling, 0 replies; 17+ messages in thread
From: Jason Lewis @ 2012-07-18 19:20 UTC (permalink / raw)
  To: Rainbows! list

Sorry to add unproductive chatter, but this is the best subject line I've
ever seen on a tech mailing list.

:-)

Jason

On 2012-07-18 14:52 , "Samuel Kadolph" <samuel.kadolph-/3HedJEncLlQ0OI7PeSoCw@public.gmane.org>
wrote:

>Hey rainbows-talk,
>
>We have 40 servers that each run rainbows with 2 workers with 100
>threads using ThreadPool. We're having an issue where unicorn is
>killing the worker process. We use ThreadTimeout (set to 70 seconds)
>and originally had the unicorn timeout set to 150 seconds and we're
>seeing unicorn eventually killing each worker. So we bumped the
>timeout to 300 seconds and it took about 5 minutes but we started
>seeing unicorn starting to kill workers again. You can see our stderr
>log file (timeout at 300s) at
>https://gist.github.com/9ec96922e55a59753997. Any insight into why
>unicorn is killing our ThreadPool workers would help us greatly. If
>you require additional info I would be happy to provide it.
>
>Samuel Kadolph
>samuel.kadolph-BqItboTaHx1BDgjK7y7TUQ@public.gmane.org
>16134043579
>_______________________________________________
>Rainbows! mailing list - rainbows-talk-GrnCvJ7WPxnNLxjTenLetw@public.gmane.org
>http://rubyforge.org/mailman/listinfo/rainbows-talk
>Do not quote signatures (like this one) or top post when replying

_______________________________________________
Rainbows! mailing list - rainbows-talk-GrnCvJ7WPxnNLxjTenLetw@public.gmane.org
http://rubyforge.org/mailman/listinfo/rainbows-talk
Do not quote signatures (like this one) or top post when replying


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Unicorn is killing our rainbows workers
       [not found] ` <CAFFC5+MUdUoXhBXvw8VnnVAZsQpN1idELr0nc_Xm0HYcdtQVhA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2012-07-18 19:20   ` Jason Lewis
@ 2012-07-18 21:52   ` Eric Wong
       [not found]     ` <20120718215222.GA11539-yBiyF41qdooeIZ0/mPfg9Q@public.gmane.org>
  1 sibling, 1 reply; 17+ messages in thread
From: Eric Wong @ 2012-07-18 21:52 UTC (permalink / raw)
  To: Rainbows! list

Samuel Kadolph <samuel.kadolph-/3HedJEncLlQ0OI7PeSoCw@public.gmane.org> wrote:
> Hey rainbows-talk,
> 
> We have 40 servers that each run rainbows with 2 workers with 100
> threads using ThreadPool. We're having an issue where unicorn is
> killing the worker process. We use ThreadTimeout (set to 70 seconds)
> and originally had the unicorn timeout set to 150 seconds and we're
> seeing unicorn eventually killing each worker. So we bumped the
> timeout to 300 seconds and it took about 5 minutes but we started
> seeing unicorn starting to kill workers again. You can see our stderr
> log file (timeout at 300s) at
> https://gist.github.com/9ec96922e55a59753997. Any insight into why
> unicorn is killing our ThreadPool workers would help us greatly. If
> you require additional info I would be happy to provide it.

Which Ruby version/patchlevel are you using?  1.8 and 1.9 have vastly
different thread implementations and workarounds to deal with.

What C extensions are you using?

ThreadTimeout might also be conflicting with some libraries you use and
causing deadlocks.  Also, ThreadTimeout might not be a good idea with
many common libraries which:

1) use the stdlib Timeout internally
2) rely on ensure clauses firing

ThreadTimeout turns out to be difficult to use correctly with existing
code, so it may not be appropriate for you.  Your app should use
localized timeouts as much as possible (using timeout mechanisms built
into libraries you use).

Also, please don't use private gist (especially when posting to public
mailing list), it requires a github account to clone from and I'll
never require (nor encourage :P) needing any website account
for contributing to Rainbows!, just an email address.
_______________________________________________
Rainbows! mailing list - rainbows-talk-GrnCvJ7WPxnNLxjTenLetw@public.gmane.org
http://rubyforge.org/mailman/listinfo/rainbows-talk
Do not quote signatures (like this one) or top post when replying


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Unicorn is killing our rainbows workers
       [not found]     ` <20120718215222.GA11539-yBiyF41qdooeIZ0/mPfg9Q@public.gmane.org>
@ 2012-07-18 23:06       ` Samuel Kadolph
       [not found]         ` <CAFFC5+N=_bnyM=0WbtLxPAncs0TV4wA9P8TXZ_-T3qOtW-+w3Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 17+ messages in thread
From: Samuel Kadolph @ 2012-07-18 23:06 UTC (permalink / raw)
  To: Rainbows! list; +Cc: Cody Fauser, ops, Harry Brundage, Jonathan Rudenberg

On Wed, Jul 18, 2012 at 5:52 PM, Eric Wong <normalperson-rMlxZR9MS24@public.gmane.org> wrote:
> Samuel Kadolph <samuel.kadolph-/3HedJEncLlQ0OI7PeSoCw@public.gmane.org> wrote:
>> Hey rainbows-talk,
>>
>> We have 40 servers that each run rainbows with 2 workers with 100
>> threads using ThreadPool. We're having an issue where unicorn is
>> killing the worker process. We use ThreadTimeout (set to 70 seconds)
>> and originally had the unicorn timeout set to 150 seconds and we're
>> seeing unicorn eventually killing each worker. So we bumped the
>> timeout to 300 seconds and it took about 5 minutes but we started
>> seeing unicorn starting to kill workers again. You can see our stderr
>> log file (timeout at 300s) at
>> https://gist.github.com/9ec96922e55a59753997. Any insight into why
>> unicorn is killing our ThreadPool workers would help us greatly. If
>> you require additional info I would be happy to provide it.
>
> Which Ruby version/patchlevel are you using?  1.8 and 1.9 have vastly
> different thread implementations and workarounds to deal with.
>
> What C extensions are you using?
>
> ThreadTimeout might also be conflicting with some libraries you use and
> causing deadlocks.  Also, ThreadTimeout might not be a good idea with
> many common libraries which:
>
> 1) use the stdlib Timeout internally
> 2) rely on ensure clauses firing
>
> ThreadTimeout turns out to be difficult to use correctly with existing
> code, so it may not be appropriate for you.  Your app should use
> localized timeouts as much as possible (using timeout mechanisms built
> into libraries you use).
>
> Also, please don't use private gist (especially when posting to public
> mailing list), it requires a github account to clone from and I'll
> never require (nor encourage :P) needing any website account
> for contributing to Rainbows!, just an email address.

We're running ruby 1.9.3-p125 with the performance patches at
https://gist.github.com/1688857. I listed the gems we use and which
ones that have c extension at https://gist.github.com/3139226.

We'll try running without the ThreadTimeout. We don't think we're
having deadlock issues because our stress tests do not timeout but
they do 502 when the rainbows worker gets killed during a request.
_______________________________________________
Rainbows! mailing list - rainbows-talk-GrnCvJ7WPxnNLxjTenLetw@public.gmane.org
http://rubyforge.org/mailman/listinfo/rainbows-talk
Do not quote signatures (like this one) or top post when replying


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Unicorn is killing our rainbows workers
       [not found]         ` <CAFFC5+N=_bnyM=0WbtLxPAncs0TV4wA9P8TXZ_-T3qOtW-+w3Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2012-07-19  0:26           ` Eric Wong
       [not found]             ` <20120719002641.GA17210-yBiyF41qdooeIZ0/mPfg9Q@public.gmane.org>
  0 siblings, 1 reply; 17+ messages in thread
From: Eric Wong @ 2012-07-19  0:26 UTC (permalink / raw)
  To: Rainbows! list; +Cc: Cody Fauser, ops, Harry Brundage, Jonathan Rudenberg

Samuel Kadolph <samuel.kadolph-BqItboTaHx1BDgjK7y7TUQ@public.gmane.org> wrote:
> On Wed, Jul 18, 2012 at 5:52 PM, Eric Wong <normalperson-rMlxZR9MS24@public.gmane.org> wrote:
> > Samuel Kadolph <samuel.kadolph-/3HedJEncLlQ0OI7PeSoCw@public.gmane.org> wrote:
> >> Hey rainbows-talk,
> >>
> >> We have 40 servers that each run rainbows with 2 workers with 100
> >> threads using ThreadPool. We're having an issue where unicorn is
> >> killing the worker process. We use ThreadTimeout (set to 70 seconds)
> >> and originally had the unicorn timeout set to 150 seconds and we're
> >> seeing unicorn eventually killing each worker. So we bumped the
> >> timeout to 300 seconds and it took about 5 minutes but we started
> >> seeing unicorn starting to kill workers again. You can see our stderr
> >> log file (timeout at 300s) at
> >> https://gist.github.com/9ec96922e55a59753997. Any insight into why
> >> unicorn is killing our ThreadPool workers would help us greatly. If
> >> you require additional info I would be happy to provide it.

Also, are you using "preload_app true" ?

I'm a bit curious how these messages are happening, too:
D, [2012-07-18T15:12:43.185808 #17213] DEBUG -- : waiting 151.5s after
suspend/hibernation

Can you tell (from Rails logs) if the to-be-killed workers are still
processing requests/responses the 300s before when the unicorn timeout
hits it?  AFAIK, Rails logs the PID of each worker processing the
request.

Also, what in your app takes 150s, or even 70s?  I'm curious why the
timeouts are so high.  I wonder if there are bugs with unicorn/rainbows
with huge timeout values, too...

If anything, I'd lower the unicorn timeout to something low (maybe
5-10s) since that detects hard lockups at the VM level.  Individual
requests in Rainbows! _are_ allowed to take longer than the unicorn
timeout.

Can you reproduce this in a simulation environment or only with real
traffic?  If possible, can you setup an instance with a single worker
process and get an strace ("strace -f") of all the threads when this
happens?

> We're running ruby 1.9.3-p125 with the performance patches at
> https://gist.github.com/1688857.

Can you reproduce this with an unpatched 1.9.3-p194?  I'm not too
familiar with the performance patches, but I'd like to reduce the amount
of less-common/tested code to isolate the issue.

> I listed the gems we use and which
> ones that have c extension at https://gist.github.com/3139226.

Fortunately, I'm familiar with nearly all of these C gems.

Newer versions of mysql2 should avoid potential issues with
ThreadTimeout/Timeout (or anything that hits Thread#kill).  I think
mysql2 0.2.9 fixed a fairly important bug, and 0.2.18 fixed a very rare
(but possibly related to your issue) bug,

Unrelated to your current issue, I strongly suggest Ruby 1.9.3-p194,
previous versions had a nasty GC memory corruption bug triggered
by Nokogiri (ref: https://github.com/tenderlove/nokogiri/issues/616)

I also have no idea why mongrel is in there :x

> We'll try running without the ThreadTimeout. We don't think we're
> having deadlock issues because our stress tests do not timeout but
> they do 502 when the rainbows worker gets killed during a request.

OK.  I'm starting to believe ThreadTimeout isn't good for the majority
of applications out there, and perhaps the only way is to have support
for this tightly coupled with the VM.  Even then, "ensure" clauses would
still be tricky/ugly to deal with...  So maybe forcing developers to use
app/library-level timeouts for everything they do is the only way.
_______________________________________________
Rainbows! mailing list - rainbows-talk-GrnCvJ7WPxnNLxjTenLetw@public.gmane.org
http://rubyforge.org/mailman/listinfo/rainbows-talk
Do not quote signatures (like this one) or top post when replying


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Unicorn is killing our rainbows workers
       [not found]             ` <20120719002641.GA17210-yBiyF41qdooeIZ0/mPfg9Q@public.gmane.org>
@ 2012-07-19 14:29               ` Samuel Kadolph
       [not found]                 ` <CAFFC5+NfChEobr7asqPx+3-U8_mHZqOgCLjRw=w6iCZ=z0-oCg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 17+ messages in thread
From: Samuel Kadolph @ 2012-07-19 14:29 UTC (permalink / raw)
  To: Rainbows! list; +Cc: Cody Fauser, ops, Harry Brundage, Jonathan Rudenberg

On Wed, Jul 18, 2012 at 8:26 PM, Eric Wong <normalperson-rMlxZR9MS24@public.gmane.org> wrote:
> Samuel Kadolph <samuel.kadolph-BqItboTaHx1BDgjK7y7TUQ@public.gmane.org> wrote:
>> On Wed, Jul 18, 2012 at 5:52 PM, Eric Wong <normalperson-rMlxZR9MS24@public.gmane.org> wrote:
>> > Samuel Kadolph <samuel.kadolph-/3HedJEncLlQ0OI7PeSoCw@public.gmane.org> wrote:
>> >> Hey rainbows-talk,
>> >>
>> >> We have 40 servers that each run rainbows with 2 workers with 100
>> >> threads using ThreadPool. We're having an issue where unicorn is
>> >> killing the worker process. We use ThreadTimeout (set to 70 seconds)
>> >> and originally had the unicorn timeout set to 150 seconds and we're
>> >> seeing unicorn eventually killing each worker. So we bumped the
>> >> timeout to 300 seconds and it took about 5 minutes but we started
>> >> seeing unicorn starting to kill workers again. You can see our stderr
>> >> log file (timeout at 300s) at
>> >> https://gist.github.com/9ec96922e55a59753997. Any insight into why
>> >> unicorn is killing our ThreadPool workers would help us greatly. If
>> >> you require additional info I would be happy to provide it.
>
> Also, are you using "preload_app true" ?

Yes we are using preload_app true.

> I'm a bit curious how these messages are happening, too:
> D, [2012-07-18T15:12:43.185808 #17213] DEBUG -- : waiting 151.5s after
> suspend/hibernation

They are strange. My current hunch is the killing and that message are
symptoms of the same issue. Since it always follows a killing.

> Can you tell (from Rails logs) if the to-be-killed workers are still
> processing requests/responses the 300s before when the unicorn timeout
> hits it?  AFAIK, Rails logs the PID of each worker processing the
> request.

rails doesn't log the pid but it would seem that after upgrading to
mysql 0.2.18 it is no longer killing workers that are busy with
requests.

> Also, what in your app takes 150s, or even 70s?  I'm curious why the
> timeouts are so high.  I wonder if there are bugs with unicorn/rainbows
> with huge timeout values, too...
>
> If anything, I'd lower the unicorn timeout to something low (maybe
> 5-10s) since that detects hard lockups at the VM level.  Individual
> requests in Rainbows! _are_ allowed to take longer than the unicorn
> timeout.

We lowered the unicorn timeout to 5 seconds and but that did not
change the killings but they seem to be happening less often. I have
some of our stderr logs after setting the timeout to 5 seconds at
https://gist.github.com/3144250.

> Can you reproduce this in a simulation environment or only with real
> traffic?  If possible, can you setup an instance with a single worker
> process and get an strace ("strace -f") of all the threads when this
> happens?

We haven't been able to reproduce it locally. We have a staging
environment for this app so I will see if I can use it and try to
replicate it.

>> We're running ruby 1.9.3-p125 with the performance patches at
>> https://gist.github.com/1688857.
>
> Can you reproduce this with an unpatched 1.9.3-p194?  I'm not too
> familiar with the performance patches, but I'd like to reduce the amount
> of less-common/tested code to isolate the issue.

We cannot try p194 right now because one of our ops is on a trip but
once he's back I'm sure we'll try that and let you know.

>> I listed the gems we use and which
>> ones that have c extension at https://gist.github.com/3139226.
>
> Fortunately, I'm familiar with nearly all of these C gems.
>
> Newer versions of mysql2 should avoid potential issues with
> ThreadTimeout/Timeout (or anything that hits Thread#kill).  I think
> mysql2 0.2.9 fixed a fairly important bug, and 0.2.18 fixed a very rare
> (but possibly related to your issue) bug,

Upgrading mysql2 seems to have stopped unicorn from killing workers
that are currently busy. We were stress testing it last night and
after we upgraded to 0.2.18 we had no more 502s from the app but this
could be a coincidence since the killings are still happen.

> Unrelated to your current issue, I strongly suggest Ruby 1.9.3-p194,
> previous versions had a nasty GC memory corruption bug triggered
> by Nokogiri (ref: https://github.com/tenderlove/nokogiri/issues/616)
>
> I also have no idea why mongrel is in there :x

I forgot to only show bundle for production.

>> We'll try running without the ThreadTimeout. We don't think we're
>> having deadlock issues because our stress tests do not timeout but
>> they do 502 when the rainbows worker gets killed during a request.
>
> OK.  I'm starting to believe ThreadTimeout isn't good for the majority
> of applications out there, and perhaps the only way is to have support
> for this tightly coupled with the VM.  Even then, "ensure" clauses would
> still be tricky/ugly to deal with...  So maybe forcing developers to use
> app/library-level timeouts for everything they do is the only way.

Our ops guys say we had this problem before we were using ThreadTimeout.
_______________________________________________
Rainbows! mailing list - rainbows-talk-GrnCvJ7WPxnNLxjTenLetw@public.gmane.org
http://rubyforge.org/mailman/listinfo/rainbows-talk
Do not quote signatures (like this one) or top post when replying


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Unicorn is killing our rainbows workers
       [not found]                 ` <CAFFC5+NfChEobr7asqPx+3-U8_mHZqOgCLjRw=w6iCZ=z0-oCg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2012-07-19 20:16                   ` Eric Wong
       [not found]                     ` <20120719201633.GA8203-yBiyF41qdooeIZ0/mPfg9Q@public.gmane.org>
  0 siblings, 1 reply; 17+ messages in thread
From: Eric Wong @ 2012-07-19 20:16 UTC (permalink / raw)
  To: Rainbows! list; +Cc: Cody Fauser, ops, Harry Brundage, Jonathan Rudenberg

Samuel Kadolph <samuel.kadolph-BqItboTaHx1BDgjK7y7TUQ@public.gmane.org> wrote:
> On Wed, Jul 18, 2012 at 8:26 PM, Eric Wong <normalperson-rMlxZR9MS24@public.gmane.org> wrote:
> > Samuel Kadolph <samuel.kadolph-BqItboTaHx1BDgjK7y7TUQ@public.gmane.org> wrote:
> >> On Wed, Jul 18, 2012 at 5:52 PM, Eric Wong <normalperson-rMlxZR9MS24@public.gmane.org> wrote:
> >> > Samuel Kadolph <samuel.kadolph-/3HedJEncLlQ0OI7PeSoCw@public.gmane.org> wrote:
> >> >> https://gist.github.com/9ec96922e55a59753997. Any insight into why
> >> >> unicorn is killing our ThreadPool workers would help us greatly. If
> >> >> you require additional info I would be happy to provide it.
> >
> > Also, are you using "preload_app true" ?
> 
> Yes we are using preload_app true.
> 
> > I'm a bit curious how these messages are happening, too:
> > D, [2012-07-18T15:12:43.185808 #17213] DEBUG -- : waiting 151.5s after
> > suspend/hibernation
> 
> They are strange. My current hunch is the killing and that message are
> symptoms of the same issue. Since it always follows a killing.

I wonder if there's some background thread one of your gems spawns on
load that causes the master to stall.  I'm not seeing how else unicorn
could think it was in suspend/hibernation.

> > Can you tell (from Rails logs) if the to-be-killed workers are still
> > processing requests/responses the 300s before when the unicorn timeout
> > hits it?  AFAIK, Rails logs the PID of each worker processing the
> > request.
> 
> rails doesn't log the pid but it would seem that after upgrading to
> mysql 0.2.18 it is no longer killing workers that are busy with
> requests.

Oops, I think I've been spoiled into thinking the Hodel3000CompliantLogger
is the default Rails logger :)

> > If anything, I'd lower the unicorn timeout to something low (maybe
> > 5-10s) since that detects hard lockups at the VM level.  Individual
> > requests in Rainbows! _are_ allowed to take longer than the unicorn
> > timeout.
> 
> We lowered the unicorn timeout to 5 seconds and but that did not
> change the killings but they seem to be happening less often. I have
> some of our stderr logs after setting the timeout to 5 seconds at
> https://gist.github.com/3144250.

Thanks for trying that!

> > Newer versions of mysql2 should avoid potential issues with
> > ThreadTimeout/Timeout (or anything that hits Thread#kill).  I think
> > mysql2 0.2.9 fixed a fairly important bug, and 0.2.18 fixed a very rare
> > (but possibly related to your issue) bug,
> 
> Upgrading mysql2 seems to have stopped unicorn from killing workers
> that are currently busy. We were stress testing it last night and
> after we upgraded to 0.2.18 we had no more 502s from the app but this
> could be a coincidence since the killings are still happen.

Alright, good to know 0.2.18 solved your problems.  Btw, have you
noticed any general connectivity issues to your MySQL server?
There were quite a few bugfixes from 0.2.6..0.2.18, though.

Anyways, I'm happy your problem seems to be fixed with the mysql2
upgrade :)

> Our ops guys say we had this problem before we were using ThreadTimeout.

OK.  That's somewhat reassuring to know (especially since the culprit
seems to be an old mysql2 gem).  I've had other users (privately) report
issues with recursive locking because of ensure clauses (e.g.
Mutex#synchronize) that I forgot to document.
_______________________________________________
Rainbows! mailing list - rainbows-talk-GrnCvJ7WPxnNLxjTenLetw@public.gmane.org
http://rubyforge.org/mailman/listinfo/rainbows-talk
Do not quote signatures (like this one) or top post when replying


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Unicorn is killing our rainbows workers
       [not found]                     ` <20120719201633.GA8203-yBiyF41qdooeIZ0/mPfg9Q@public.gmane.org>
@ 2012-07-19 20:57                       ` Samuel Kadolph
       [not found]                         ` <CAFFC5+NiPhu3oyEZ8woDdmH1zdPDDy9-fK3FhWPqv-6u=yFxgg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 17+ messages in thread
From: Samuel Kadolph @ 2012-07-19 20:57 UTC (permalink / raw)
  To: Rainbows! list; +Cc: Cody Fauser, ops, Harry Brundage, Jonathan Rudenberg

On Thu, Jul 19, 2012 at 4:16 PM, Eric Wong <normalperson-rMlxZR9MS24@public.gmane.org> wrote:
>
> Samuel Kadolph <samuel.kadolph-BqItboTaHx1BDgjK7y7TUQ@public.gmane.org> wrote:
> > On Wed, Jul 18, 2012 at 8:26 PM, Eric Wong <normalperson-rMlxZR9MS24@public.gmane.org> wrote:
> > > Samuel Kadolph <samuel.kadolph-BqItboTaHx1BDgjK7y7TUQ@public.gmane.org> wrote:
> > >> On Wed, Jul 18, 2012 at 5:52 PM, Eric Wong <normalperson-rMlxZR9MS24@public.gmane.org> wrote:
> > >> > Samuel Kadolph <samuel.kadolph-/3HedJEncLlQ0OI7PeSoCw@public.gmane.org> wrote:
> > >> >> https://gist.github.com/9ec96922e55a59753997. Any insight into why
> > >> >> unicorn is killing our ThreadPool workers would help us greatly. If
> > >> >> you require additional info I would be happy to provide it.
> > >
> > > Also, are you using "preload_app true" ?
> >
> > Yes we are using preload_app true.
> >
> > > I'm a bit curious how these messages are happening, too:
> > > D, [2012-07-18T15:12:43.185808 #17213] DEBUG -- : waiting 151.5s after
> > > suspend/hibernation
> >
> > They are strange. My current hunch is the killing and that message are
> > symptoms of the same issue. Since it always follows a killing.
>
> I wonder if there's some background thread one of your gems spawns on
> load that causes the master to stall.  I'm not seeing how else unicorn
> could think it was in suspend/hibernation.
>
> > > Can you tell (from Rails logs) if the to-be-killed workers are still
> > > processing requests/responses the 300s before when the unicorn timeout
> > > hits it?  AFAIK, Rails logs the PID of each worker processing the
> > > request.
> >
> > rails doesn't log the pid but it would seem that after upgrading to
> > mysql 0.2.18 it is no longer killing workers that are busy with
> > requests.
>
> Oops, I think I've been spoiled into thinking the Hodel3000CompliantLogger
> is the default Rails logger :)
>
> > > If anything, I'd lower the unicorn timeout to something low (maybe
> > > 5-10s) since that detects hard lockups at the VM level.  Individual
> > > requests in Rainbows! _are_ allowed to take longer than the unicorn
> > > timeout.
> >
> > We lowered the unicorn timeout to 5 seconds and but that did not
> > change the killings but they seem to be happening less often. I have
> > some of our stderr logs after setting the timeout to 5 seconds at
> > https://gist.github.com/3144250.
>
> Thanks for trying that!
>
> > > Newer versions of mysql2 should avoid potential issues with
> > > ThreadTimeout/Timeout (or anything that hits Thread#kill).  I think
> > > mysql2 0.2.9 fixed a fairly important bug, and 0.2.18 fixed a very rare
> > > (but possibly related to your issue) bug,
> >
> > Upgrading mysql2 seems to have stopped unicorn from killing workers
> > that are currently busy. We were stress testing it last night and
> > after we upgraded to 0.2.18 we had no more 502s from the app but this
> > could be a coincidence since the killings are still happen.
>
> Alright, good to know 0.2.18 solved your problems.  Btw, have you
> noticed any general connectivity issues to your MySQL server?
> There were quite a few bugfixes from 0.2.6..0.2.18, though.
>
> Anyways, I'm happy your problem seems to be fixed with the mysql2
> upgrade :)

Unfortunately that didn't fix the problem. We had a large sale today
and had 2 502s. We're going to try p194 on next week and I'll let you
know if that fixes it.

> > Our ops guys say we had this problem before we were using ThreadTimeout.
>
> OK.  That's somewhat reassuring to know (especially since the culprit
> seems to be an old mysql2 gem).  I've had other users (privately) report
> issues with recursive locking because of ensure clauses (e.g.
> Mutex#synchronize) that I forgot to document.

We're going to try going without ThreadTimeout again to make sure
that's not the issue.
_______________________________________________
Rainbows! mailing list - rainbows-talk-GrnCvJ7WPxnNLxjTenLetw@public.gmane.org
http://rubyforge.org/mailman/listinfo/rainbows-talk
Do not quote signatures (like this one) or top post when replying


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Unicorn is killing our rainbows workers
       [not found]                         ` <CAFFC5+NiPhu3oyEZ8woDdmH1zdPDDy9-fK3FhWPqv-6u=yFxgg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2012-07-19 21:31                           ` Eric Wong
       [not found]                             ` <20120719213125.GA17708-yBiyF41qdooeIZ0/mPfg9Q@public.gmane.org>
  0 siblings, 1 reply; 17+ messages in thread
From: Eric Wong @ 2012-07-19 21:31 UTC (permalink / raw)
  To: Rainbows! list; +Cc: Cody Fauser, ops, Harry Brundage, Jonathan Rudenberg

Samuel Kadolph <samuel.kadolph-BqItboTaHx1BDgjK7y7TUQ@public.gmane.org> wrote:
> On Thu, Jul 19, 2012 at 4:16 PM, Eric Wong <normalperson-rMlxZR9MS24@public.gmane.org> wrote:
> > Samuel Kadolph <samuel.kadolph-BqItboTaHx1BDgjK7y7TUQ@public.gmane.org> wrote:
> > > On Wed, Jul 18, 2012 at 8:26 PM, Eric Wong <normalperson-rMlxZR9MS24@public.gmane.org> wrote:
> > > > Samuel Kadolph <samuel.kadolph-BqItboTaHx1BDgjK7y7TUQ@public.gmane.org> wrote:
> > > >> On Wed, Jul 18, 2012 at 5:52 PM, Eric Wong <normalperson-rMlxZR9MS24@public.gmane.org> wrote:
> > > >> > Samuel Kadolph <samuel.kadolph-/3HedJEncLlQ0OI7PeSoCw@public.gmane.org> wrote:
> > > >> >> https://gist.github.com/9ec96922e55a59753997. Any insight into why
> > > >> >> unicorn is killing our ThreadPool workers would help us greatly. If
> > > >> >> you require additional info I would be happy to provide it.
> > > >
> > > > Also, are you using "preload_app true" ?
> > >
> > > Yes we are using preload_app true.
> > >
> > > > I'm a bit curious how these messages are happening, too:
> > > > D, [2012-07-18T15:12:43.185808 #17213] DEBUG -- : waiting 151.5s after
> > > > suspend/hibernation
> > >
> > > They are strange. My current hunch is the killing and that message are
> > > symptoms of the same issue. Since it always follows a killing.
> >
> > I wonder if there's some background thread one of your gems spawns on
> > load that causes the master to stall.  I'm not seeing how else unicorn
> > could think it was in suspend/hibernation.

> > Anyways, I'm happy your problem seems to be fixed with the mysql2
> > upgrade :)
> 
> Unfortunately that didn't fix the problem. We had a large sale today
> and had 2 502s. We're going to try p194 on next week and I'll let you
> know if that fixes it.

Are you seeing the same errors as before in stderr for those?

Can you also try disabling preload_app?

But before disabling preload_app, you can also check a few things on
a running master?

* "lsof -p <pid_of_master>"

  To see if there's odd connections the master is making.

* Assuming you're on Linux, can you also check for any other threads
  the master might be running (and possibly stuck on)?

    ls /proc/<pid_of_master>/task/

  The output should be 2 directories:

    <pid_of_master>/
    <tid_of_timer_thread>/

  If you have a 3rd entry, you can confirm something in your app one of
  your gems is spawning a background thread which could be throwing
  the master off...

> > > Our ops guys say we had this problem before we were using ThreadTimeout.
> >
> > OK.  That's somewhat reassuring to know (especially since the culprit
> > seems to be an old mysql2 gem).  I've had other users (privately) report
> > issues with recursive locking because of ensure clauses (e.g.
> > Mutex#synchronize) that I forgot to document.
> 
> We're going to try going without ThreadTimeout again to make sure
> that's not the issue.

Alright.

Btw, I also suggest any Rails/application-level logs include the PID and
timestamp of the request.  This way you can see and correlate the worker
killing the request to when/if the Rails app stopped processing
requests.
_______________________________________________
Rainbows! mailing list - rainbows-talk-GrnCvJ7WPxnNLxjTenLetw@public.gmane.org
http://rubyforge.org/mailman/listinfo/rainbows-talk
Do not quote signatures (like this one) or top post when replying


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Unicorn is killing our rainbows workers
       [not found]                             ` <20120719213125.GA17708-yBiyF41qdooeIZ0/mPfg9Q@public.gmane.org>
@ 2012-07-20  0:23                               ` Samuel Kadolph
       [not found]                                 ` <CAFFC5+MKdkmLknbLeRzMNzfTVoyj9JDahFSd1Nb90vsbgS4fuQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 17+ messages in thread
From: Samuel Kadolph @ 2012-07-20  0:23 UTC (permalink / raw)
  To: Rainbows! list; +Cc: Cody Fauser, ops, Harry Brundage, Jonathan Rudenberg

On Thu, Jul 19, 2012 at 5:31 PM, Eric Wong <normalperson-rMlxZR9MS24@public.gmane.org> wrote:
> Samuel Kadolph <samuel.kadolph-BqItboTaHx1BDgjK7y7TUQ@public.gmane.org> wrote:
>> On Thu, Jul 19, 2012 at 4:16 PM, Eric Wong <normalperson-rMlxZR9MS24@public.gmane.org> wrote:
>> > Samuel Kadolph <samuel.kadolph-BqItboTaHx1BDgjK7y7TUQ@public.gmane.org> wrote:
>> > > On Wed, Jul 18, 2012 at 8:26 PM, Eric Wong <normalperson-rMlxZR9MS24@public.gmane.org> wrote:
>> > > > Samuel Kadolph <samuel.kadolph-BqItboTaHx1BDgjK7y7TUQ@public.gmane.org> wrote:
>> > > >> On Wed, Jul 18, 2012 at 5:52 PM, Eric Wong <normalperson-rMlxZR9MS24@public.gmane.org> wrote:
>> > > >> > Samuel Kadolph <samuel.kadolph-/3HedJEncLlQ0OI7PeSoCw@public.gmane.org> wrote:
>> > > >> >> https://gist.github.com/9ec96922e55a59753997. Any insight into why
>> > > >> >> unicorn is killing our ThreadPool workers would help us greatly. If
>> > > >> >> you require additional info I would be happy to provide it.
>> > > >
>> > > > Also, are you using "preload_app true" ?
>> > >
>> > > Yes we are using preload_app true.
>> > >
>> > > > I'm a bit curious how these messages are happening, too:
>> > > > D, [2012-07-18T15:12:43.185808 #17213] DEBUG -- : waiting 151.5s after
>> > > > suspend/hibernation
>> > >
>> > > They are strange. My current hunch is the killing and that message are
>> > > symptoms of the same issue. Since it always follows a killing.
>> >
>> > I wonder if there's some background thread one of your gems spawns on
>> > load that causes the master to stall.  I'm not seeing how else unicorn
>> > could think it was in suspend/hibernation.
>
>> > Anyways, I'm happy your problem seems to be fixed with the mysql2
>> > upgrade :)
>>
>> Unfortunately that didn't fix the problem. We had a large sale today
>> and had 2 502s. We're going to try p194 on next week and I'll let you
>> know if that fixes it.
>
> Are you seeing the same errors as before in stderr for those?

Yeah, we get the same killing, reaping and suspend/hibernation
messages with the 5 second timeout. Upgrading mysql2 seemed to have
prevented any 502s during our stress tests but we that was no the
case.

> Can you also try disabling preload_app?
>
> But before disabling preload_app, you can also check a few things on
> a running master?
>
> * "lsof -p <pid_of_master>"
>
>   To see if there's odd connections the master is making.
>
> * Assuming you're on Linux, can you also check for any other threads
>   the master might be running (and possibly stuck on)?
>
>     ls /proc/<pid_of_master>/task/
>
>   The output should be 2 directories:
>
>     <pid_of_master>/
>     <tid_of_timer_thread>/
>
>   If you have a 3rd entry, you can confirm something in your app one of
>   your gems is spawning a background thread which could be throwing
>   the master off...

I'll see if we can try this tomorrow but it will probably be on Monday.

>> > > Our ops guys say we had this problem before we were using ThreadTimeout.
>> >
>> > OK.  That's somewhat reassuring to know (especially since the culprit
>> > seems to be an old mysql2 gem).  I've had other users (privately) report
>> > issues with recursive locking because of ensure clauses (e.g.
>> > Mutex#synchronize) that I forgot to document.
>>
>> We're going to try going without ThreadTimeout again to make sure
>> that's not the issue.
>
> Alright.
>
> Btw, I also suggest any Rails/application-level logs include the PID and
> timestamp of the request.  This way you can see and correlate the worker
> killing the request to when/if the Rails app stopped processing
> requests.

We found that one of our servers was actually out of the ELB pool so
it wasn't getting pinged constantly and it does not have any killing
messages (other than deploys, which also had the suspend/hibernation
messages). We'll have more time free next week to dig further into
this.
_______________________________________________
Rainbows! mailing list - rainbows-talk-GrnCvJ7WPxnNLxjTenLetw@public.gmane.org
http://rubyforge.org/mailman/listinfo/rainbows-talk
Do not quote signatures (like this one) or top post when replying


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Unicorn is killing our rainbows workers
       [not found]                                 ` <CAFFC5+MKdkmLknbLeRzMNzfTVoyj9JDahFSd1Nb90vsbgS4fuQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2012-07-26 23:48                                   ` Eric Wong
       [not found]                                     ` <20120726234845.GA29453-yBiyF41qdooeIZ0/mPfg9Q@public.gmane.org>
  0 siblings, 1 reply; 17+ messages in thread
From: Eric Wong @ 2012-07-26 23:48 UTC (permalink / raw)
  To: Rainbows! list; +Cc: Cody Fauser, ops, Harry Brundage, Jonathan Rudenberg

Samuel Kadolph <samuel.kadolph-BqItboTaHx1BDgjK7y7TUQ@public.gmane.org> wrote:
> We'll have more time free next week to dig further into this.

Hi Samuel, any update on this?
_______________________________________________
Rainbows! mailing list - rainbows-talk-GrnCvJ7WPxnNLxjTenLetw@public.gmane.org
http://rubyforge.org/mailman/listinfo/rainbows-talk
Do not quote signatures (like this one) or top post when replying


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Unicorn is killing our rainbows workers
       [not found]                                     ` <20120726234845.GA29453-yBiyF41qdooeIZ0/mPfg9Q@public.gmane.org>
@ 2012-07-27  0:00                                       ` Samuel Kadolph
       [not found]                                         ` <CAFFC5+PvKhbRWH9aLKgc3k-z+2tEPpqLrMa5+6mEUnO2K_X+9Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 17+ messages in thread
From: Samuel Kadolph @ 2012-07-27  0:00 UTC (permalink / raw)
  To: Rainbows! list; +Cc: Cody Fauser, ops, Harry Brundage, Jonathan Rudenberg

On Thu, Jul 26, 2012 at 7:48 PM, Eric Wong <normalperson-rMlxZR9MS24@public.gmane.org> wrote:
> Samuel Kadolph <samuel.kadolph-BqItboTaHx1BDgjK7y7TUQ@public.gmane.org> wrote:
>> We'll have more time free next week to dig further into this.
>
> Hi Samuel, any update on this?

Our ops guys have been busy so I don't have the output from lsof but
it didn't look like it was spawning any extra threads or opening any
unexplainable connections. But I think we should have been checking
the worker processes and not the master, right?

Haven't tried disabling preload_app yet but we have tried
ruby-1.9.3-p194 and that did not resolve the issue. We've also
upgraded to rails 3.2 and that also did not resolve the issue.
_______________________________________________
Rainbows! mailing list - rainbows-talk-GrnCvJ7WPxnNLxjTenLetw@public.gmane.org
http://rubyforge.org/mailman/listinfo/rainbows-talk
Do not quote signatures (like this one) or top post when replying


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Unicorn is killing our rainbows workers
       [not found]                                         ` <CAFFC5+PvKhbRWH9aLKgc3k-z+2tEPpqLrMa5+6mEUnO2K_X+9Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2012-07-27  0:11                                           ` Eric Wong
       [not found]                                             ` <20120727001125.GA30957-yBiyF41qdooeIZ0/mPfg9Q@public.gmane.org>
  0 siblings, 1 reply; 17+ messages in thread
From: Eric Wong @ 2012-07-27  0:11 UTC (permalink / raw)
  To: Rainbows! list; +Cc: Cody Fauser, ops, Harry Brundage, Jonathan Rudenberg

Samuel Kadolph <samuel.kadolph-BqItboTaHx1BDgjK7y7TUQ@public.gmane.org> wrote:
> On Thu, Jul 26, 2012 at 7:48 PM, Eric Wong <normalperson-rMlxZR9MS24@public.gmane.org> wrote:
> > Samuel Kadolph <samuel.kadolph-BqItboTaHx1BDgjK7y7TUQ@public.gmane.org> wrote:
> >> We'll have more time free next week to dig further into this.
> >
> > Hi Samuel, any update on this?
> 
> Our ops guys have been busy so I don't have the output from lsof but
> it didn't look like it was spawning any extra threads or opening any
> unexplainable connections. But I think we should have been checking
> the worker processes and not the master, right?

Definitely check the master, too.  It's the master that seems to
believe it's suspended, so that makes me believe something is wrong
with the master (and this is likely due to preload_app).

> Haven't tried disabling preload_app yet but we have tried
_______________________________________________
Rainbows! mailing list - rainbows-talk-GrnCvJ7WPxnNLxjTenLetw@public.gmane.org
http://rubyforge.org/mailman/listinfo/rainbows-talk
Do not quote signatures (like this one) or top post when replying


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Unicorn is killing our rainbows workers
       [not found]                                             ` <20120727001125.GA30957-yBiyF41qdooeIZ0/mPfg9Q@public.gmane.org>
@ 2012-07-27 20:01                                               ` Samuel Kadolph
       [not found]                                                 ` <CAFFC5+MqyVEfLJN2rxae7_NPOT=8+X4cBbTz6YYgLzuC8ySXjg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 17+ messages in thread
From: Samuel Kadolph @ 2012-07-27 20:01 UTC (permalink / raw)
  To: Rainbows! list; +Cc: Cody Fauser, ops, Harry Brundage, Jonathan Rudenberg

On Thu, Jul 26, 2012 at 8:11 PM, Eric Wong <normalperson-rMlxZR9MS24@public.gmane.org> wrote:
> Samuel Kadolph <samuel.kadolph-BqItboTaHx1BDgjK7y7TUQ@public.gmane.org> wrote:
>> On Thu, Jul 26, 2012 at 7:48 PM, Eric Wong <normalperson-rMlxZR9MS24@public.gmane.org> wrote:
>> > Samuel Kadolph <samuel.kadolph-BqItboTaHx1BDgjK7y7TUQ@public.gmane.org> wrote:
>> >> We'll have more time free next week to dig further into this.
>> >
>> > Hi Samuel, any update on this?
>>
>> Our ops guys have been busy so I don't have the output from lsof but
>> it didn't look like it was spawning any extra threads or opening any
>> unexplainable connections. But I think we should have been checking
>> the worker processes and not the master, right?
>
> Definitely check the master, too.  It's the master that seems to
> believe it's suspended, so that makes me believe something is wrong
> with the master (and this is likely due to preload_app).
>
>> Haven't tried disabling preload_app yet but we have tried

I've got the output of lsof and ls at https://gist.github.com/3190171.
_______________________________________________
Rainbows! mailing list - rainbows-talk-GrnCvJ7WPxnNLxjTenLetw@public.gmane.org
http://rubyforge.org/mailman/listinfo/rainbows-talk
Do not quote signatures (like this one) or top post when replying


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Unicorn is killing our rainbows workers
       [not found]                                                 ` <CAFFC5+MqyVEfLJN2rxae7_NPOT=8+X4cBbTz6YYgLzuC8ySXjg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2012-07-27 20:40                                                   ` Eric Wong
       [not found]                                                     ` <20120727204040.GA2192-yBiyF41qdooeIZ0/mPfg9Q@public.gmane.org>
  0 siblings, 1 reply; 17+ messages in thread
From: Eric Wong @ 2012-07-27 20:40 UTC (permalink / raw)
  To: Rainbows! list; +Cc: Cody Fauser, ops, Harry Brundage, Jonathan Rudenberg

Samuel Kadolph <samuel.kadolph-BqItboTaHx1BDgjK7y7TUQ@public.gmane.org> wrote:
> On Thu, Jul 26, 2012 at 8:11 PM, Eric Wong <normalperson-rMlxZR9MS24@public.gmane.org> wrote:
> >> Our ops guys have been busy so I don't have the output from lsof but
> >> it didn't look like it was spawning any extra threads or opening any
> >> unexplainable connections. But I think we should have been checking
> >> the worker processes and not the master, right?
> >
> > Definitely check the master, too.  It's the master that seems to
> > believe it's suspended, so that makes me believe something is wrong
> > with the master (and this is likely due to preload_app).
> >
> >> Haven't tried disabling preload_app yet but we have tried
> 
> I've got the output of lsof and ls at https://gist.github.com/3190171.

Thanks, that's the output for the master?  I don't see anything
obviously wrong.

I seem to recall the Ruby library responsible for the following log file
also spawns its own background thread, but your "ls" only shows 2 tasks
(instead of 3):

> ruby    26564 root    9w   REG              202,1     51221   529742 APP_PATH/shared/log/newrelic_agent.log

> $ ls /proc/26564/task/
> 26564  27052

(While the Ruby code for the module responsible for that log file is
 technically "open", it's not Free, so I'm not comfortable looking at
 that code).
_______________________________________________
Rainbows! mailing list - rainbows-talk-GrnCvJ7WPxnNLxjTenLetw@public.gmane.org
http://rubyforge.org/mailman/listinfo/rainbows-talk
Do not quote signatures (like this one) or top post when replying


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Unicorn is killing our rainbows workers
       [not found]                                                     ` <20120727204040.GA2192-yBiyF41qdooeIZ0/mPfg9Q@public.gmane.org>
@ 2012-07-31 14:09                                                       ` Samuel Kadolph
       [not found]                                                         ` <CAFFC5+OYa5+nVqLFnzVkfAyq8WU57QztkvcP5tdSBDWU-2+SaQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 17+ messages in thread
From: Samuel Kadolph @ 2012-07-31 14:09 UTC (permalink / raw)
  To: Rainbows! list; +Cc: Cody Fauser, ops, Harry Brundage, Jonathan Rudenberg

On Fri, Jul 27, 2012 at 4:40 PM, Eric Wong <normalperson-rMlxZR9MS24@public.gmane.org> wrote:
> Samuel Kadolph <samuel.kadolph-BqItboTaHx1BDgjK7y7TUQ@public.gmane.org> wrote:
>> On Thu, Jul 26, 2012 at 8:11 PM, Eric Wong <normalperson-rMlxZR9MS24@public.gmane.org> wrote:
>> >> Our ops guys have been busy so I don't have the output from lsof but
>> >> it didn't look like it was spawning any extra threads or opening any
>> >> unexplainable connections. But I think we should have been checking
>> >> the worker processes and not the master, right?
>> >
>> > Definitely check the master, too.  It's the master that seems to
>> > believe it's suspended, so that makes me believe something is wrong
>> > with the master (and this is likely due to preload_app).
>> >
>> >> Haven't tried disabling preload_app yet but we have tried
>>
>> I've got the output of lsof and ls at https://gist.github.com/3190171.
>
> Thanks, that's the output for the master?  I don't see anything
> obviously wrong.
>
> I seem to recall the Ruby library responsible for the following log file
> also spawns its own background thread, but your "ls" only shows 2 tasks
> (instead of 3):
>
>> ruby    26564 root    9w   REG              202,1     51221   529742 APP_PATH/shared/log/newrelic_agent.log
>
>> $ ls /proc/26564/task/
>> 26564  27052
>
> (While the Ruby code for the module responsible for that log file is
>  technically "open", it's not Free, so I'm not comfortable looking at
>  that code).

So 2 updates: yes that lsof output is from the master process and
using preload_app false solves the issue. No more killings and the
suspend/hibernation messages stopped as well. We lost newrelic data so
we're going to try putting preload_app back to true and removing the
newrelic gem.
_______________________________________________
Rainbows! mailing list - rainbows-talk-GrnCvJ7WPxnNLxjTenLetw@public.gmane.org
http://rubyforge.org/mailman/listinfo/rainbows-talk
Do not quote signatures (like this one) or top post when replying


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Unicorn is killing our rainbows workers
       [not found]                                                         ` <CAFFC5+OYa5+nVqLFnzVkfAyq8WU57QztkvcP5tdSBDWU-2+SaQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2012-07-31 20:28                                                           ` Eric Wong
  0 siblings, 0 replies; 17+ messages in thread
From: Eric Wong @ 2012-07-31 20:28 UTC (permalink / raw)
  To: Rainbows! list; +Cc: Cody Fauser, ops, Harry Brundage, Jonathan Rudenberg

Samuel Kadolph <samuel.kadolph-BqItboTaHx1BDgjK7y7TUQ@public.gmane.org> wrote:
> So 2 updates: yes that lsof output is from the master process and
> using preload_app false solves the issue. No more killings and the
> suspend/hibernation messages stopped as well. We lost newrelic data so
> we're going to try putting preload_app back to true and removing the
> newrelic gem.

Thank you for the updates and reporting the resolution!  Hopefully
all goes well with other gems.
_______________________________________________
Rainbows! mailing list - rainbows-talk-GrnCvJ7WPxnNLxjTenLetw@public.gmane.org
http://rubyforge.org/mailman/listinfo/rainbows-talk
Do not quote signatures (like this one) or top post when replying


^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, back to index

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-07-18 18:52 Unicorn is killing our rainbows workers Samuel Kadolph
     [not found] ` <CAFFC5+MUdUoXhBXvw8VnnVAZsQpN1idELr0nc_Xm0HYcdtQVhA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2012-07-18 19:20   ` Jason Lewis
2012-07-18 21:52   ` Eric Wong
     [not found]     ` <20120718215222.GA11539-yBiyF41qdooeIZ0/mPfg9Q@public.gmane.org>
2012-07-18 23:06       ` Samuel Kadolph
     [not found]         ` <CAFFC5+N=_bnyM=0WbtLxPAncs0TV4wA9P8TXZ_-T3qOtW-+w3Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2012-07-19  0:26           ` Eric Wong
     [not found]             ` <20120719002641.GA17210-yBiyF41qdooeIZ0/mPfg9Q@public.gmane.org>
2012-07-19 14:29               ` Samuel Kadolph
     [not found]                 ` <CAFFC5+NfChEobr7asqPx+3-U8_mHZqOgCLjRw=w6iCZ=z0-oCg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2012-07-19 20:16                   ` Eric Wong
     [not found]                     ` <20120719201633.GA8203-yBiyF41qdooeIZ0/mPfg9Q@public.gmane.org>
2012-07-19 20:57                       ` Samuel Kadolph
     [not found]                         ` <CAFFC5+NiPhu3oyEZ8woDdmH1zdPDDy9-fK3FhWPqv-6u=yFxgg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2012-07-19 21:31                           ` Eric Wong
     [not found]                             ` <20120719213125.GA17708-yBiyF41qdooeIZ0/mPfg9Q@public.gmane.org>
2012-07-20  0:23                               ` Samuel Kadolph
     [not found]                                 ` <CAFFC5+MKdkmLknbLeRzMNzfTVoyj9JDahFSd1Nb90vsbgS4fuQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2012-07-26 23:48                                   ` Eric Wong
     [not found]                                     ` <20120726234845.GA29453-yBiyF41qdooeIZ0/mPfg9Q@public.gmane.org>
2012-07-27  0:00                                       ` Samuel Kadolph
     [not found]                                         ` <CAFFC5+PvKhbRWH9aLKgc3k-z+2tEPpqLrMa5+6mEUnO2K_X+9Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2012-07-27  0:11                                           ` Eric Wong
     [not found]                                             ` <20120727001125.GA30957-yBiyF41qdooeIZ0/mPfg9Q@public.gmane.org>
2012-07-27 20:01                                               ` Samuel Kadolph
     [not found]                                                 ` <CAFFC5+MqyVEfLJN2rxae7_NPOT=8+X4cBbTz6YYgLzuC8ySXjg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2012-07-27 20:40                                                   ` Eric Wong
     [not found]                                                     ` <20120727204040.GA2192-yBiyF41qdooeIZ0/mPfg9Q@public.gmane.org>
2012-07-31 14:09                                                       ` Samuel Kadolph
     [not found]                                                         ` <CAFFC5+OYa5+nVqLFnzVkfAyq8WU57QztkvcP5tdSBDWU-2+SaQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2012-07-31 20:28                                                           ` Eric Wong

Rainbows! Rack HTTP server user/dev discussion

Archives are clonable:
	git clone --mirror http://bogomips.org/rainbows-public
	git clone --mirror http://ou63pmih66umazou.onion/rainbows-public

Example config snippet for mirrors

Newsgroups are available over NNTP:
	nntp://news.public-inbox.org/inbox.comp.lang.ruby.rainbows
	nntp://ou63pmih66umazou.onion/inbox.comp.lang.ruby.rainbows

 note: .onion URLs require Tor: https://www.torproject.org/

AGPL code for this site: git clone https://public-inbox.org/public-inbox.git