From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: AS33070 50.56.128.0/17 X-Spam-Status: No, score=0.0 required=3.0 tests=MSGID_FROM_MTA_HEADER, TVD_RCVD_IP shortcircuit=no autolearn=unavailable version=3.3.2 Path: news.gmane.org!not-for-mail From: Samuel Kadolph Newsgroups: gmane.comp.lang.ruby.rainbows.general Subject: Re: Unicorn is killing our rainbows workers Date: Thu, 19 Jul 2012 20:23:35 -0400 Message-ID: References: <20120718215222.GA11539@dcvr.yhbt.net> <20120719002641.GA17210@dcvr.yhbt.net> <20120719201633.GA8203@dcvr.yhbt.net> <20120719213125.GA17708@dcvr.yhbt.net> NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit X-Trace: dough.gmane.org 1342743830 655 80.91.229.3 (20 Jul 2012 00:23:50 GMT) X-Complaints-To: usenet@dough.gmane.org NNTP-Posting-Date: Fri, 20 Jul 2012 00:23:50 +0000 (UTC) Cc: Cody Fauser , ops , Harry Brundage , Jonathan Rudenberg To: "Rainbows! list" Original-X-From: rainbows-talk-bounces-GrnCvJ7WPxnNLxjTenLetw@public.gmane.org Fri Jul 20 02:23:49 2012 Return-path: Envelope-to: gclrrg-rainbows-talk@m.gmane.org X-Original-To: rainbows-talk-GrnCvJ7WPxnNLxjTenLetw@public.gmane.org Delivered-To: rainbows-talk-GrnCvJ7WPxnNLxjTenLetw@public.gmane.org DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=shopify.com; s=google; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=QQEF8GJNO3IS1rYhLlp6wuI1diPIgzUAiEs6RMfMbaE=; b=C3E7uxwuE0UwhVG0Xb4Bs7UysqF5EMOncGJBZotrRALlWYJQKEhvFHHxzs/WTZgrkz kulNJmzY7z+4uRcTBpLI0tyr6fShOsktpRwb4vZ0B/GS9PX8EFCi4KIVRqeabCLcLf/Z Y8zr4dFihc8/dTMsRe3Dika6qHkLC5U15TUrg= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type:x-gm-message-state; bh=QQEF8GJNO3IS1rYhLlp6wuI1diPIgzUAiEs6RMfMbaE=; b=Ov/r7bo4v8tMxl51d9n0goT1zn9xTpetscP9ih7SQC6NKQCru+/feWI5g48bLp4SBR FHnI/DWT2MeR2GoLywUbkCIbu9L30SJqpfFjQ+eKd2Km/Ok4AA+2UF+2P+yEwdgfgkPM /1htXW0G8zvOvbH1jxSqtHsnB714HWIPaQ0S5cfpYSNUrN0Rv3+W9aBOplGb21VdZwL/ RIbsGwpscIq4Mmv8YDF6mdaiUmRkhWifKnxDwAAm+IIEnS7wV4aH0hlyss13JIZM2Z30 iAAMjSZjXzf9sApqy/xBnOe3L7484hkRcUqo2RU5b7kTOX1phgI/qqxHFc2Hh+ndZp+y NxZA== In-Reply-To: <20120719213125.GA17708-yBiyF41qdooeIZ0/mPfg9Q@public.gmane.org> X-Gm-Message-State: ALoCoQmGpej41DXJbZmm5+Bppsp7pPDo2+lVavD+uRUmP2TIbyiPOzlx2cK5tWWwplqenVLSDK/a X-BeenThere: rainbows-talk-GrnCvJ7WPxnNLxjTenLetw@public.gmane.org X-Mailman-Version: 2.1.12 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: rainbows-talk-bounces-GrnCvJ7WPxnNLxjTenLetw@public.gmane.org Errors-To: rainbows-talk-bounces-GrnCvJ7WPxnNLxjTenLetw@public.gmane.org Xref: news.gmane.org gmane.comp.lang.ruby.rainbows.general:380 Archived-At: Received: from 50-56-192-79.static.cloud-ips.com ([50.56.192.79] helo=rubyforge.org) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1Ss10T-0002HU-Dd for gclrrg-rainbows-talk@m.gmane.org; Fri, 20 Jul 2012 02:23:45 +0200 Received: from localhost.localdomain (localhost [127.0.0.1]) by rubyforge.org (Postfix) with ESMTP id 537AB2E06E; Fri, 20 Jul 2012 00:23:42 +0000 (UTC) Received: from mail-pb0-f50.google.com (mail-pb0-f50.google.com [209.85.160.50]) by rubyforge.org (Postfix) with ESMTP id 6BB812E069 for ; Fri, 20 Jul 2012 00:23:36 +0000 (UTC) Received: by pbbrr4 with SMTP id rr4so5725384pbb.23 for ; Thu, 19 Jul 2012 17:23:35 -0700 (PDT) Received: by 10.68.235.68 with SMTP id uk4mr9238111pbc.52.1342743815831; Thu, 19 Jul 2012 17:23:35 -0700 (PDT) Received: by 10.66.217.225 with HTTP; Thu, 19 Jul 2012 17:23:35 -0700 (PDT) On Thu, Jul 19, 2012 at 5:31 PM, Eric Wong wrote: > Samuel Kadolph wrote: >> On Thu, Jul 19, 2012 at 4:16 PM, Eric Wong wrote: >> > Samuel Kadolph wrote: >> > > On Wed, Jul 18, 2012 at 8:26 PM, Eric Wong wrote: >> > > > Samuel Kadolph wrote: >> > > >> On Wed, Jul 18, 2012 at 5:52 PM, Eric Wong wrote: >> > > >> > Samuel Kadolph wrote: >> > > >> >> https://gist.github.com/9ec96922e55a59753997. Any insight into why >> > > >> >> unicorn is killing our ThreadPool workers would help us greatly. If >> > > >> >> you require additional info I would be happy to provide it. >> > > > >> > > > Also, are you using "preload_app true" ? >> > > >> > > Yes we are using preload_app true. >> > > >> > > > I'm a bit curious how these messages are happening, too: >> > > > D, [2012-07-18T15:12:43.185808 #17213] DEBUG -- : waiting 151.5s after >> > > > suspend/hibernation >> > > >> > > They are strange. My current hunch is the killing and that message are >> > > symptoms of the same issue. Since it always follows a killing. >> > >> > I wonder if there's some background thread one of your gems spawns on >> > load that causes the master to stall. I'm not seeing how else unicorn >> > could think it was in suspend/hibernation. > >> > Anyways, I'm happy your problem seems to be fixed with the mysql2 >> > upgrade :) >> >> Unfortunately that didn't fix the problem. We had a large sale today >> and had 2 502s. We're going to try p194 on next week and I'll let you >> know if that fixes it. > > Are you seeing the same errors as before in stderr for those? Yeah, we get the same killing, reaping and suspend/hibernation messages with the 5 second timeout. Upgrading mysql2 seemed to have prevented any 502s during our stress tests but we that was no the case. > Can you also try disabling preload_app? > > But before disabling preload_app, you can also check a few things on > a running master? > > * "lsof -p " > > To see if there's odd connections the master is making. > > * Assuming you're on Linux, can you also check for any other threads > the master might be running (and possibly stuck on)? > > ls /proc//task/ > > The output should be 2 directories: > > / > / > > If you have a 3rd entry, you can confirm something in your app one of > your gems is spawning a background thread which could be throwing > the master off... I'll see if we can try this tomorrow but it will probably be on Monday. >> > > Our ops guys say we had this problem before we were using ThreadTimeout. >> > >> > OK. That's somewhat reassuring to know (especially since the culprit >> > seems to be an old mysql2 gem). I've had other users (privately) report >> > issues with recursive locking because of ensure clauses (e.g. >> > Mutex#synchronize) that I forgot to document. >> >> We're going to try going without ThreadTimeout again to make sure >> that's not the issue. > > Alright. > > Btw, I also suggest any Rails/application-level logs include the PID and > timestamp of the request. This way you can see and correlate the worker > killing the request to when/if the Rails app stopped processing > requests. We found that one of our servers was actually out of the ELB pool so it wasn't getting pinged constantly and it does not have any killing messages (other than deploys, which also had the suspend/hibernation messages). We'll have more time free next week to dig further into this. _______________________________________________ Rainbows! mailing list - rainbows-talk-GrnCvJ7WPxnNLxjTenLetw@public.gmane.org http://rubyforge.org/mailman/listinfo/rainbows-talk Do not quote signatures (like this one) or top post when replying