From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on dcvr.yhbt.net
X-Spam-Level: 
X-Spam-ASN: AS33070 50.56.128.0/17
X-Spam-Status: No, score=0.0 required=3.0 tests=MSGID_FROM_MTA_HEADER,
 TVD_RCVD_IP shortcircuit=no autolearn=unavailable version=3.3.2
Path: news.gmane.org!not-for-mail
From: Samuel Kadolph <samuel.kadolph-BqItboTaHx1BDgjK7y7TUQ@public.gmane.org>
Newsgroups: gmane.comp.lang.ruby.rainbows.general
Subject: Re: Unicorn is killing our rainbows workers
Date: Thu, 19 Jul 2012 10:29:02 -0400
Message-ID:
 <CAFFC5+NfChEobr7asqPx+3-U8_mHZqOgCLjRw=w6iCZ=z0-oCg@mail.gmail.com>
References:
 <CAFFC5+MUdUoXhBXvw8VnnVAZsQpN1idELr0nc_Xm0HYcdtQVhA@mail.gmail.com>
 <20120718215222.GA11539@dcvr.yhbt.net>
 <CAFFC5+N=_bnyM=0WbtLxPAncs0TV4wA9P8TXZ_-T3qOtW-+w3Q@mail.gmail.com>
 <20120719002641.GA17210@dcvr.yhbt.net>
NNTP-Posting-Host: plane.gmane.org
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
X-Trace: dough.gmane.org 1342708170 7817 80.91.229.3 (19 Jul 2012 14:29:30
 GMT)
X-Complaints-To: usenet@dough.gmane.org
NNTP-Posting-Date: Thu, 19 Jul 2012 14:29:30 +0000 (UTC)
Cc: Cody Fauser <cody.fauser-BqItboTaHx1BDgjK7y7TUQ@public.gmane.org>, ops
 <ops-BqItboTaHx1BDgjK7y7TUQ@public.gmane.org>, Harry Brundage
 <harry.brundage-BqItboTaHx1BDgjK7y7TUQ@public.gmane.org>, Jonathan Rudenberg
 <jonathan.rudenberg-BqItboTaHx1BDgjK7y7TUQ@public.gmane.org>
To: "Rainbows! list" <rainbows-talk-GrnCvJ7WPxnNLxjTenLetw@public.gmane.org>
Original-X-From:
 rainbows-talk-bounces-GrnCvJ7WPxnNLxjTenLetw@public.gmane.org Thu Jul 19
 16:29:26 2012
Return-path: <rainbows-talk-bounces-GrnCvJ7WPxnNLxjTenLetw@public.gmane.org>
Envelope-to: gclrrg-rainbows-talk@m.gmane.org
X-Original-To: rainbows-talk-GrnCvJ7WPxnNLxjTenLetw@public.gmane.org
Delivered-To: rainbows-talk-GrnCvJ7WPxnNLxjTenLetw@public.gmane.org
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=shopify.com;
 s=google;
 h=mime-version:in-reply-to:references:date:message-id:subject:from:to
 :cc:content-type; bh=wF2+bG2gQFPazjGKxlQe+5xxNKXvT7ggCk0tPg2SwS8=;
 b=Czrjj58RfuxIZe0GwH9flNryncT5qwBUuThCpwEFpsL2QLbj45cN0Nd0IHnE7X0Zwy
 Wrg6QYc2W1ExfxQSNImexj1CzCeM6bGifoxys7DpfN3uMICO0IhW4jKMgvoF5XlkdpWM
 N4A3Qug8PPWoROFwawfw95N1U83YYkVX0M8oY=
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=20120113;
 h=mime-version:in-reply-to:references:date:message-id:subject:from:to
 :cc:content-type:x-gm-message-state;
 bh=wF2+bG2gQFPazjGKxlQe+5xxNKXvT7ggCk0tPg2SwS8=;
 b=Lkz1QQEY3rUT6ftx1e9YmiMJ9wwx5COTY6bFQI6v+bi7Rs5zE/hFgYlX2AJcTkzEFJ
 c07CvcMpWk12gz3ULLbh2D3K/AJSV78KFGpHM24vRoHFuNbJaI2Zts1ecIdQCdbdkjTF
 JiRYdHPnd0G6gbTKs8UQCBmzE2DJHcEHIGlQEcxSUmGcMmFrFqdi+76IzQlAjxR+5BfZ
 M5tiXZHuNsjdj6upFPQUjYHAR0V8Tge8/6njR8pctW/laOsAWc66uQNkokaSI1Yot8AQ
 SYpUGYjjJPYJEDBKiPZ5KP8InyaY2vXl9o5zLdcVQ0eOaXlvxjiMi5yvWDbfYS34knrR 6QoQ==
In-Reply-To: <20120719002641.GA17210-yBiyF41qdooeIZ0/mPfg9Q@public.gmane.org>
X-Gm-Message-State:
 ALoCoQmgE2i3/fx5lZwuraDt8Gvn7zmOTfjy1tpzzUU4U729Fy2nkd7ZFiik+Nwrqe/KS2BOl/HQ
X-BeenThere: rainbows-talk-GrnCvJ7WPxnNLxjTenLetw@public.gmane.org
X-Mailman-Version: 2.1.12
Precedence: list
List-Id: <rainbows-public@bogomips.org>
List-Unsubscribe: <http://rubyforge.org/mailman/options/rainbows-talk>,
 <mailto:rainbows-talk-request-GrnCvJ7WPxnNLxjTenLetw@public.gmane.org?subject=unsubscribe>
List-Archive: <http://rubyforge.org/pipermail/rainbows-talk>
List-Post: <rainbows-public@bogomips.org>
List-Help:
 <mailto:rainbows-talk-request-GrnCvJ7WPxnNLxjTenLetw@public.gmane.org?subject=help>
List-Subscribe: <http://rubyforge.org/mailman/listinfo/rainbows-talk>,
 <mailto:rainbows-talk-request-GrnCvJ7WPxnNLxjTenLetw@public.gmane.org?subject=subscribe>
Original-Sender:
 rainbows-talk-bounces-GrnCvJ7WPxnNLxjTenLetw@public.gmane.org
Errors-To: rainbows-talk-bounces-GrnCvJ7WPxnNLxjTenLetw@public.gmane.org
Xref: news.gmane.org gmane.comp.lang.ruby.rainbows.general:375
Archived-At:
 <http://permalink.gmane.org/gmane.comp.lang.ruby.rainbows.general/375>
Received: from 50-56-192-79.static.cloud-ips.com ([50.56.192.79]
 helo=rubyforge.org) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from
 <rainbows-talk-bounces-GrnCvJ7WPxnNLxjTenLetw@public.gmane.org>) id
 1SrrjG-00033W-Vr for gclrrg-rainbows-talk@m.gmane.org; Thu, 19 Jul 2012
 16:29:23 +0200
Received: from localhost.localdomain (localhost [127.0.0.1]) by rubyforge.org
 (Postfix) with ESMTP id 98CDA2E067; Thu, 19 Jul 2012 14:29:18 +0000 (UTC)
Received: from mail-pb0-f50.google.com (mail-pb0-f50.google.com
 [209.85.160.50]) by rubyforge.org (Postfix) with ESMTP id 7EF272E067 for
 <rainbows-talk-GrnCvJ7WPxnNLxjTenLetw@public.gmane.org>; Thu, 19 Jul 2012
 14:29:03 +0000 (UTC)
Received: by pbbrr4 with SMTP id rr4so4933857pbb.23 for
 <rainbows-talk-GrnCvJ7WPxnNLxjTenLetw@public.gmane.org>; Thu, 19 Jul 2012
 07:29:03 -0700 (PDT)
Received: by 10.68.200.98 with SMTP id jr2mr5706463pbc.81.1342708142852; Thu,
 19 Jul 2012 07:29:02 -0700 (PDT)
Received: by 10.66.217.225 with HTTP; Thu, 19 Jul 2012 07:29:02 -0700 (PDT)

On Wed, Jul 18, 2012 at 8:26 PM, Eric Wong <normalperson-rMlxZR9MS24@public.gmane.org> wrote:
> Samuel Kadolph <samuel.kadolph-BqItboTaHx1BDgjK7y7TUQ@public.gmane.org> wrote:
>> On Wed, Jul 18, 2012 at 5:52 PM, Eric Wong <normalperson-rMlxZR9MS24@public.gmane.org> wrote:
>> > Samuel Kadolph <samuel.kadolph-/3HedJEncLlQ0OI7PeSoCw@public.gmane.org> wrote:
>> >> Hey rainbows-talk,
>> >>
>> >> We have 40 servers that each run rainbows with 2 workers with 100
>> >> threads using ThreadPool. We're having an issue where unicorn is
>> >> killing the worker process. We use ThreadTimeout (set to 70 seconds)
>> >> and originally had the unicorn timeout set to 150 seconds and we're
>> >> seeing unicorn eventually killing each worker. So we bumped the
>> >> timeout to 300 seconds and it took about 5 minutes but we started
>> >> seeing unicorn starting to kill workers again. You can see our stderr
>> >> log file (timeout at 300s) at
>> >> https://gist.github.com/9ec96922e55a59753997. Any insight into why
>> >> unicorn is killing our ThreadPool workers would help us greatly. If
>> >> you require additional info I would be happy to provide it.
>
> Also, are you using "preload_app true" ?

Yes we are using preload_app true.

> I'm a bit curious how these messages are happening, too:
> D, [2012-07-18T15:12:43.185808 #17213] DEBUG -- : waiting 151.5s after
> suspend/hibernation

They are strange. My current hunch is the killing and that message are
symptoms of the same issue. Since it always follows a killing.

> Can you tell (from Rails logs) if the to-be-killed workers are still
> processing requests/responses the 300s before when the unicorn timeout
> hits it?  AFAIK, Rails logs the PID of each worker processing the
> request.

rails doesn't log the pid but it would seem that after upgrading to
mysql 0.2.18 it is no longer killing workers that are busy with
requests.

> Also, what in your app takes 150s, or even 70s?  I'm curious why the
> timeouts are so high.  I wonder if there are bugs with unicorn/rainbows
> with huge timeout values, too...
>
> If anything, I'd lower the unicorn timeout to something low (maybe
> 5-10s) since that detects hard lockups at the VM level.  Individual
> requests in Rainbows! _are_ allowed to take longer than the unicorn
> timeout.

We lowered the unicorn timeout to 5 seconds and but that did not
change the killings but they seem to be happening less often. I have
some of our stderr logs after setting the timeout to 5 seconds at
https://gist.github.com/3144250.

> Can you reproduce this in a simulation environment or only with real
> traffic?  If possible, can you setup an instance with a single worker
> process and get an strace ("strace -f") of all the threads when this
> happens?

We haven't been able to reproduce it locally. We have a staging
environment for this app so I will see if I can use it and try to
replicate it.

>> We're running ruby 1.9.3-p125 with the performance patches at
>> https://gist.github.com/1688857.
>
> Can you reproduce this with an unpatched 1.9.3-p194?  I'm not too
> familiar with the performance patches, but I'd like to reduce the amount
> of less-common/tested code to isolate the issue.

We cannot try p194 right now because one of our ops is on a trip but
once he's back I'm sure we'll try that and let you know.

>> I listed the gems we use and which
>> ones that have c extension at https://gist.github.com/3139226.
>
> Fortunately, I'm familiar with nearly all of these C gems.
>
> Newer versions of mysql2 should avoid potential issues with
> ThreadTimeout/Timeout (or anything that hits Thread#kill).  I think
> mysql2 0.2.9 fixed a fairly important bug, and 0.2.18 fixed a very rare
> (but possibly related to your issue) bug,

Upgrading mysql2 seems to have stopped unicorn from killing workers
that are currently busy. We were stress testing it last night and
after we upgraded to 0.2.18 we had no more 502s from the app but this
could be a coincidence since the killings are still happen.

> Unrelated to your current issue, I strongly suggest Ruby 1.9.3-p194,
> previous versions had a nasty GC memory corruption bug triggered
> by Nokogiri (ref: https://github.com/tenderlove/nokogiri/issues/616)
>
> I also have no idea why mongrel is in there :x

I forgot to only show bundle for production.

>> We'll try running without the ThreadTimeout. We don't think we're
>> having deadlock issues because our stress tests do not timeout but
>> they do 502 when the rainbows worker gets killed during a request.
>
> OK.  I'm starting to believe ThreadTimeout isn't good for the majority
> of applications out there, and perhaps the only way is to have support
> for this tightly coupled with the VM.  Even then, "ensure" clauses would
> still be tricky/ugly to deal with...  So maybe forcing developers to use
> app/library-level timeouts for everything they do is the only way.

Our ops guys say we had this problem before we were using ThreadTimeout.
_______________________________________________
Rainbows! mailing list - rainbows-talk-GrnCvJ7WPxnNLxjTenLetw@public.gmane.org
http://rubyforge.org/mailman/listinfo/rainbows-talk
Do not quote signatures (like this one) or top post when replying