From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on dcvr.yhbt.net X-Spam-Level: X-Spam-Status: No, score=-3.2 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,SPF_HELO_NONE,SPF_PASS shortcircuit=no autolearn=ham autolearn_force=no version=3.4.2 Received: from mx1.dfw.automattic.com (mx1.dfw.automattic.com [192.0.84.151]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by dcvr.yhbt.net (Postfix) with ESMTPS id 618811F4B4 for ; Wed, 20 Jan 2021 05:22:00 +0000 (UTC) Received: from localhost (localhost.localdomain [127.0.0.1]) by mx1.dfw.automattic.com (Postfix) with ESMTP id E18C51C1B8A for ; Wed, 20 Jan 2021 05:21:59 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=automattic.com; h=content-type:content-type:subject:subject:message-id:date :date:from:from:in-reply-to:references:mime-version:received :received:received:received:received; s=automattic1; t= 1611120119; bh=5EPkPTPGH6cH6seuoHqpudzWpN6rPIk51zzQksxOqCU=; b=C BgAQRztdzver+EYOhNxxYAieBfgoDJgNSHCg0qrxyjmT+OACV+GU7C6nZ8apeHXV S6OPYgOiBN3izemAU+GdIpAKPwyqxdRqwdKjape09HSw7gZGjZSTPzBN2fGIk8Ep osRmHwTI4tZKaAvTOyzLE7Whz8fwzPWwwMfYZ02Sh6x5lv2LFSggq9iJP5l+wA1T 9xx+P7DKAnXB3EDUI5LlyYmzxDza42gXusS0pOkLo0j3jEWxtJg+rh+22AxKH93s iZj40//mFw5DFtJqJLzLd6xlzfDjpW9oNSv3dHJk0izB1N06osFKHarp2gpSWrlK 9jUvIqaXjequ8zFrybD1A== X-Virus-Scanned: Debian amavisd-new at wordpress.com Received: from mx1.dfw.automattic.com ([127.0.0.1]) by localhost (mx1.dfw.automattic.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id HfDEPKSdZkLe for ; Wed, 20 Jan 2021 05:21:59 +0000 (UTC) Received: from smtp-gw.dca.automattic.com (smtp-gw.dca.automattic.com [192.0.97.210]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx1.dfw.automattic.com (Postfix) with ESMTPS id 8F3771C134E for ; Wed, 20 Jan 2021 05:21:59 +0000 (UTC) Authentication-Results: mail.automattic.com; dkim=pass (2048-bit key; unprotected) header.d=automattic.com header.i=@automattic.com header.b="GpTKNl9W"; dkim=fail reason="signature verification failed" (2048-bit key; unprotected) header.d=a8c-com.20150623.gappssmtp.com header.i=@a8c-com.20150623.gappssmtp.com header.b="a6AJGyPk"; dkim-atps=neutral Received: from smtp-gw.dca.automattic.com (localhost.localdomain [127.0.0.1]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by smtp-gw.dca.automattic.com (Postfix) with ESMTPS id 2BD3FA09B5 for ; Wed, 20 Jan 2021 05:21:59 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=automattic.com; s=automattic1; t=1611120119; bh=5EPkPTPGH6cH6seuoHqpudzWpN6rPIk51zzQksxOqCU=; h=References:In-Reply-To:From:Date:Subject:To:Cc:From; b=GpTKNl9WW4m3xtL4e4CzLuNAuSec6SdvKKjiLFamIcqczdzF/+t5q9OFPkgPZTjTE XtDguIfdDxOtEN8wgsgeTNXGCXqvOfFjVSo/GinPSOxK89e53DpSixusDm9tklLGEA VbDQXJaOzF0QY7paEhMBwLtT+EDkgGM1vD0ydxT5LpldRGSVGNVx4t/7yiEG78NI8W B7hyZmCW7/T4TXJwqNJzAFok0JOQ/Lnwq2GJq1MiOCDaV0oocliuVWJmZuOu+VDIWB GLU8FVI3BUHJUCk4rff/s88Z6DXUHYXmr/0svHZj68vrAD9fiEBxv3FN54ItXP6EBP R+U+VH4LUvTNQ== Received: from mail-lj1-f199.google.com (mail-lj1-f199.google.com [209.85.208.199]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by smtp-gw.dca.automattic.com (Postfix) with ESMTPS id 157C1A099F for ; Wed, 20 Jan 2021 05:21:59 +0000 (UTC) Received: by mail-lj1-f199.google.com with SMTP id b26so6053850lje.13 for ; Tue, 19 Jan 2021 21:21:59 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=a8c-com.20150623.gappssmtp.com; s=20150623; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=8YiHuhw0lDf2BMw377SKUSWkeNJBx6S6iGY+tqbzHUU=; b=a6AJGyPki+n/XJvLpMjtmUKCQHpxC36q9fmitn9lCj+1VBp0eOUF7iRalYO7QUNKcE fSBBj6ZKXtHk3pu4UJqwYW1kk6518shNsfSbjWyjVC2d3nnyYqv76Kwo4JK2/jQYwG2j KnmmHa2DQmGyw0uOt3OQWpoJE3O2aKnC2oXoi4p1G0q0hc0WypOJg0j6tk2uuwvTk0xa /AxKtucb7pkM5X+19xtsOX10vkGJSa4lQSVLLdvLcn09WR1s4SJQJ10oU7tXkJ70D+ln inzJe1nY1dkNFAjCbzC3wUOmp0Rvr7iPLzcEJropVc3lfqb7eX0E12V5AMeJ7HhdDnXA BSOg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=8YiHuhw0lDf2BMw377SKUSWkeNJBx6S6iGY+tqbzHUU=; b=na/mLBTSD9UZ07BgIO9b2gODlPcZOEg8sNWRluG8WwEc1+JcYQJqaJuXVCQBCS10vc UHDReUWfuVWxFWwomHiUNBGW4eN9sDornkwF0ZqrzQ9EHvDcmCPViXsaTD6J+PK1FP5t oeN/wkIsbwu1SzycBcTwvwEO/dHefjXlfZwrUieRzPUTXwuXiDDn4mCNk0NVEu2PSoba 8EjSElTJDjb/Bs96MNmlKfFSEFgTiT2atpSDKxTBTzXXDfA/BENho72tpUyay7IrBV49 WWQHhXVdOmqiLq4fLcLwHRhmsVJXXo7witIgKkMrxtajTghjiW4sq3aeoQUR6RdNXibw MH6w== X-Gm-Message-State: AOAM532YLVm3IHkH5BtLxVDq/km8BuZ/vI+y0x/Im6XkoLgvXrYWau+k w+RpNme9YUlaZzxLoTJ5pvNHINk37wlLo0TVDJj0OvL8LcOmhTgkFvvWfl9+esoh1JVEnv43Quj +ppvm4JtX/ETyc+P0emCCt5XyWggM8Z77wHQQ4w== X-Received: by 2002:a19:c144:: with SMTP id r65mr1512372lff.640.1611120117839; Tue, 19 Jan 2021 21:21:57 -0800 (PST) X-Google-Smtp-Source: ABdhPJwSyEOGb6iHALuE6uNFG7N+wrebGyBMxi3LfX8GU9Y/n7SXQShru/0rgNcJLuUY0jurRQ0gCKUCAI/hP1p8OR8= X-Received: by 2002:a19:c144:: with SMTP id r65mr1512358lff.640.1611120117539; Tue, 19 Jan 2021 21:21:57 -0800 (PST) MIME-Version: 1.0 References: <20210111212621.GA12555@dcvr> <20210117095109.GA28219@dcvr> In-Reply-To: <20210117095109.GA28219@dcvr> From: Xiao Yu Date: Wed, 20 Jan 2021 05:21:44 +0000 Message-ID: Subject: Re: Segfaults on http_close? To: Eric Wong Cc: Xiao Yu , Arkadi Colson , cmogstored-public@yhbt.net Content-Type: text/plain; charset="UTF-8" List-Id: Thanks for the quick response! Sorry about the delay but I ran into a couple issues (sorry kinda learning gdb and compiling binaries in general on the go here) and have not been able to capture more useful logs yet as crashes have seem to have slowed / stopped since recompiling and reloading. For the record I recompiled cmogstored with the newer RH `devtoolset-9-toolchain` (9.1) and it has not crash since. Also sorry about the lack of useful logs in my initial message but neither kernel logs nor messages contained anything interesting around the segfaults. Making matters worse, we didn't consistently reload cmogstored as various versions of the compiled binary was installed across the cluster and didn't really save the debugging symbols from each of the compilations so can't really reply with a more useful stack trace with the current core dumps. For what it's worth, we also run another cluster with 1.7.3 that we've upgraded over the years and have never seen this issue. Those nodes are on a different distro (Debian) if that makes any difference. On Sun, Jan 17, 2021 at 9:51 AM Eric Wong wrote: > > Eric Wong wrote: > > Xiao Yu wrote: > > > Howdy, we are running a 96 node cmogstored cluster and have noticed > > > that when the cluster is busy with lots of writes we occasionally get > > > segfaults in cmogstored. This has happened 7 times in the past week > > > each time on a random and different cmogstored node. Looking at the > > > abrt backtrace of the core dump shows something similar to the > > > following in each instance: > > > > Thanks for the bug report, sorry this caused you trouble > > and I wonder if this is the same issue Arkadi was hitting > > last year... > > Hi Xiao and Arkadi: Can either of you try the 1-line patch > below to disable pthread_attr_setstacksize? I'm going to let the current version run for a bit more and hope we get a core dump, I think I finally have things cleaned up so that all 96 of our servers are running the same binary and have saved the right / corresponding debugging symbols file to check the stack trace later. If we see another crash I'll recompile with the patch below and try again. :) > I took another look at the code and couldn't find any other > culprits... (though I admit I'm not mentally as sharp given > pandemic-induced stress and anxiety :<). Oof, sorry to hear that, take care of yourself man! > Given the mysterious nature of this problem and my inability to > reproduce it; I wonder if there's stack corruption with certain > compilers/glibc happening and blowing past the 4K guard page... > > @Arkadi: Xiao recently brought up this (or similar) issue again: > https://yhbt.net/cmogstored-public/20210111212621.GA12555@dcvr/T/ > > diff --git a/thrpool.c b/thrpool.c > index bc67ea0..bd71f95 100644 > --- a/thrpool.c > +++ b/thrpool.c > @@ -141,7 +141,7 @@ thrpool_add(struct mog_thrpool *tp, unsigned size, unsigned long *nr_eagain) > > CHECK(int, 0, pthread_attr_init(&attr)); > > - if (stacksize > 0) > + if (0 && stacksize > 0) > CHECK(int, 0, pthread_attr_setstacksize(&attr, stacksize)); > > thr = &tp->threads[tp->n_threads].thr; > > > > In retrospect, running a small stack is unnecessary on 64-bit > systems due to practically unlimited virtual address space and > lazy allocation. It may still make sense for 32-bit (some > embedded systems), though they can set RLIMIT_STACK before > launching.