Rainbows! + EventMachine + Sinatra::Synchrony == pegged CPU when idle?

Rainbows! Rack HTTP server user/dev discussion
 help / color / mirror / code / Atom feed

* Rainbows! + EventMachine + Sinatra::Synchrony == pegged CPU when idle?
@ 2012-06-19  9:06 Ilya Maykov
       [not found] ` <CAAsVqN8=HxMbbQkfonojsoCHgJAPsDhC7hqqcBK9GnYsO2ww7g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 4+ messages in thread
From: Ilya Maykov @ 2012-06-19  9:06 UTC (permalink / raw)
  To: rainbows-talk-GrnCvJ7WPxnNLxjTenLetw

Hi all,

We're using Rainbows + EventMachine + Sinatra::Synchrony to run a
fleet of RESTful web servers backed by a Cassandra cluster. We are
using the EventMachineTransport to talk to Cassandra with an
EM::Synchrony::ConnectionPool in each rainbows worker. We have a Storm
cluster pushing a large stream of real-time data into the Rainbows
fleet using HTTP PUT requests. We're running into some very strange
performance issues and need help figuring out what's going on.

Basically, when load is low, everything looks good. When we crank up
the load, all of a sudden the CPU gets pegged, request latencies go
waaaay up, and requests start timing out. Once this state is reached,
the high CPU usage (4 rainbows worker processes at ~50% each on a
2-core machine = nearly full load) remains even if we completely shut
off all incoming traffic. Taking a look with strace -p, it looks like
the rainbows processes are writing ascii NUL characters to file
descriptor 7 (which is a FIFO) as fast as the kernel will let them. My
guess is that the worker is trying to communicate with the rainbows
master process via the FIFO. Not sure what is triggering this
behavior, but would like to know if anyone else has ever seen
something like this. This thread sounded like it could've been a
similar issue, but died out without any conclusion:
http://rubyforge.org/pipermail/rainbows-talk/2012-April/000345.html

Some details about the setup:

6-node cassandra cluster
3 nodes running rainbows web servers
4 rainbows workers per node
max of 50 cassandra connections per rainbows worker
rainbows.conf has:

Rainbows! do
  use :EventMachine
  worker_connections 50
  keepalive_requests 1000
  keepalive_timeout  10
end

So, each rainbows node can handle 4 * 50 = 200 simultaneous connections

12 Storm worker processes writing to the rainbows web servers
each Storm worker has max of 10 connections open to each of the 3 rainbows nodes

So, each rainbows node has 12 * 10 = 120 incoming connections from Storm.

Have been playing around with the numbers, the bug (assuming it is a
bug) seems to be easier to trigger when I increase the number of
incoming connections (from Storm workers), even if they are a lot less
than the rainbows servers can take (60-70% of the max connections is
usually enough). The bug is also easier to trigger when we increase
the volume of data we're pushing through Storm - hundreds or thousands
of requests per minute, no bug - hundreds of thousands of requests per
minute, yes bug. Cassandra is not the issue, it can easily take the
write load we're generating and is basically idle.

Any help in figuring this out would be greatly appreciated. Thanks,

-- Ilya
_______________________________________________
Rainbows! mailing list - rainbows-talk-GrnCvJ7WPxnNLxjTenLetw@public.gmane.org
http://rubyforge.org/mailman/listinfo/rainbows-talk
Do not quote signatures (like this one) or top post when replying

^ permalink raw reply	[flat|nested] 4+ messages in thread

[parent not found: <CAAsVqN8=HxMbbQkfonojsoCHgJAPsDhC7hqqcBK9GnYsO2ww7g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]

* Re: Rainbows! + EventMachine + Sinatra::Synchrony == pegged CPU when idle?
       [not found] ` <CAAsVqN8=HxMbbQkfonojsoCHgJAPsDhC7hqqcBK9GnYsO2ww7g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2012-06-19 17:54   ` Eric Wong
       [not found]     ` <20120619175409.GA27303-yBiyF41qdooeIZ0/mPfg9Q@public.gmane.org>
  0 siblings, 1 reply; 4+ messages in thread
From: Eric Wong @ 2012-06-19 17:54 UTC (permalink / raw)
  To: Rainbows! list

Ilya Maykov <ivmaykov-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> Hi all,
> 
> We're using Rainbows + EventMachine + Sinatra::Synchrony to run a
> fleet of RESTful web servers backed by a Cassandra cluster. We are
> using the EventMachineTransport to talk to Cassandra with an
> EM::Synchrony::ConnectionPool in each rainbows worker. We have a Storm
> cluster pushing a large stream of real-time data into the Rainbows
> fleet using HTTP PUT requests. We're running into some very strange
> performance issues and need help figuring out what's going on.

I'm not at all familiar with Storm nor Cassandra.  How big are the HTTP
PUT requests Rainbows! is getting?  Is Storm pipelining HTTP requests by
any chance?  That may not do well with the EM portion of Rainbows!

> Basically, when load is low, everything looks good. When we crank up
> the load, all of a sudden the CPU gets pegged, request latencies go
> waaaay up, and requests start timing out. Once this state is reached,
> the high CPU usage (4 rainbows worker processes at ~50% each on a
> 2-core machine = nearly full load) remains even if we completely shut
> off all incoming traffic.

> Taking a look with strace -p, it looks like
> the rainbows processes are writing ascii NUL characters to file
> descriptor 7 (which is a FIFO) as fast as the kernel will let them. My
> guess is that the worker is trying to communicate with the rainbows
> master process via the FIFO.

No, rainbows doesn't have code to send "\0" to the master.  I
don't think EM does, either, maybe some other library you're
using...

Which version of Ruby is this?  Try adding "-f" to follow threads
for a worker.

> Not sure what is triggering this
> behavior, but would like to know if anyone else has ever seen
> something like this. This thread sounded like it could've been a
> similar issue, but died out without any conclusion:
> http://rubyforge.org/pipermail/rainbows-talk/2012-April/000345.html
> 
> Some details about the setup:
> 
> 6-node cassandra cluster
> 3 nodes running rainbows web servers
> 4 rainbows workers per node
> max of 50 cassandra connections per rainbows worker
> rainbows.conf has:
> 
> Rainbows! do
>   use :EventMachine
>   worker_connections 50
>   keepalive_requests 1000
>   keepalive_timeout  10
> end

Do you have preload_app set to true anywhere?  (Try leaving it as false
(the default))

Can you also try "keepalive_timeout 0" to disable keepalive?  (EM
handles it internally, but I'm not sure how well)

> So, each rainbows node can handle 4 * 50 = 200 simultaneous connections
> 
> 12 Storm worker processes writing to the rainbows web servers
> each Storm worker has max of 10 connections open to each of the 3 rainbows nodes
> 
> So, each rainbows node has 12 * 10 = 120 incoming connections from Storm.
> 
> Have been playing around with the numbers, the bug (assuming it is a
> bug) seems to be easier to trigger when I increase the number of
> incoming connections (from Storm workers), even if they are a lot less
> than the rainbows servers can take (60-70% of the max connections is
> usually enough). The bug is also easier to trigger when we increase
> the volume of data we're pushing through Storm - hundreds or thousands
> of requests per minute, no bug - hundreds of thousands of requests per
> minute, yes bug. Cassandra is not the issue, it can easily take the
> write load we're generating and is basically idle.
> 
> Any help in figuring this out would be greatly appreciated. Thanks,

Try my suggestions above.

I would also search your libs/gems for what's writing "\0" since I don't
think it's Rainbows!...
_______________________________________________
Rainbows! mailing list - rainbows-talk-GrnCvJ7WPxnNLxjTenLetw@public.gmane.org
http://rubyforge.org/mailman/listinfo/rainbows-talk
Do not quote signatures (like this one) or top post when replying


^ permalink raw reply	[flat|nested] 4+ messages in thread

[parent not found: <20120619175409.GA27303-yBiyF41qdooeIZ0/mPfg9Q@public.gmane.org>]

* Re: Rainbows! + EventMachine + Sinatra::Synchrony == pegged CPU when idle?
       [not found]     ` <20120619175409.GA27303-yBiyF41qdooeIZ0/mPfg9Q@public.gmane.org>
@ 2012-07-23 23:34       ` Ilya Maykov
       [not found]         ` <CAAsVqN91_KiexPc4u9byuVstyknj7TgT4V7evs1x65W+6izzkQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 4+ messages in thread
From: Ilya Maykov @ 2012-07-23 23:34 UTC (permalink / raw)
  To: Rainbows! list; +Cc: normalperson-rMlxZR9MS24

Hi Eric,

Sorry for the delayed response. I've added inline answers to your
questions. We've since resolved this issue by disabling keepalive in
our rainbows config. So, this probably had to do with the keepalive
implementation either in Rainbows itself or in the base Unicorn code.
Answers to your other questions are inlined below.

-- Ilya


On Tue, Jun 19, 2012 at 10:54 AM, Eric Wong <normalperson-rMlxZR9MS24@public.gmane.org> wrote:
> Ilya Maykov <ivmaykov-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>> Hi all,
>>
>> We're using Rainbows + EventMachine + Sinatra::Synchrony to run a
>> fleet of RESTful web servers backed by a Cassandra cluster. We are
>> using the EventMachineTransport to talk to Cassandra with an
>> EM::Synchrony::ConnectionPool in each rainbows worker. We have a Storm
>> cluster pushing a large stream of real-time data into the Rainbows
>> fleet using HTTP PUT requests. We're running into some very strange
>> performance issues and need help figuring out what's going on.
>
> I'm not at all familiar with Storm nor Cassandra.  How big are the HTTP
> PUT requests Rainbows! is getting?  Is Storm pipelining HTTP requests by
> any chance?  That may not do well with the EM portion of Rainbows!

The PUT requests are small - probably about 100 bytes of JSON payload
per request (not counting HTTP overhead, headers, etc). Pipelining is
disabled on the clients, though keep-alive is not.

>
>> Basically, when load is low, everything looks good. When we crank up
>> the load, all of a sudden the CPU gets pegged, request latencies go
>> waaaay up, and requests start timing out. Once this state is reached,
>> the high CPU usage (4 rainbows worker processes at ~50% each on a
>> 2-core machine = nearly full load) remains even if we completely shut
>> off all incoming traffic.
>
>> Taking a look with strace -p, it looks like
>> the rainbows processes are writing ascii NUL characters to file
>> descriptor 7 (which is a FIFO) as fast as the kernel will let them. My
>> guess is that the worker is trying to communicate with the rainbows
>> master process via the FIFO.
>
> No, rainbows doesn't have code to send "\0" to the master.  I
> don't think EM does, either, maybe some other library you're
> using...
>

No idea ... I think EventMachine does start up a background thread,
not sure if that would affect anything in a bad way.

> Which version of Ruby is this?  Try adding "-f" to follow threads
> for a worker.
>

This is with Ruby 1.9.2-p290, installed by using rbenv-installer.

>> Not sure what is triggering this
>> behavior, but would like to know if anyone else has ever seen
>> something like this. This thread sounded like it could've been a
>> similar issue, but died out without any conclusion:
>> http://rubyforge.org/pipermail/rainbows-talk/2012-April/000345.html
>>
>> Some details about the setup:
>>
>> 6-node cassandra cluster
>> 3 nodes running rainbows web servers
>> 4 rainbows workers per node
>> max of 50 cassandra connections per rainbows worker
>> rainbows.conf has:
>>
>> Rainbows! do
>>   use :EventMachine
>>   worker_connections 50
>>   keepalive_requests 1000
>>   keepalive_timeout  10
>> end
>
> Do you have preload_app set to true anywhere?  (Try leaving it as false
> (the default))
>

No, we are not using preload_app.

> Can you also try "keepalive_timeout 0" to disable keepalive?  (EM
> handles it internally, but I'm not sure how well)
>

This turned out to be the problem. Disabling keepalive got rid of the
CPU pegging. Surprisingly, it also made our average latency drop from
about 50 ms to about 20 ms per request, even though every request now
has to negotiate a connection handshake. So, we're just going to keep
it disabled for now. The bug may be inside the keepalive code in
Rainbows or Unicorn (not familiar with the codebase so not sure where
that code lives).

>> So, each rainbows node can handle 4 * 50 = 200 simultaneous connections
>>
>> 12 Storm worker processes writing to the rainbows web servers
>> each Storm worker has max of 10 connections open to each of the 3 rainbows nodes
>>
>> So, each rainbows node has 12 * 10 = 120 incoming connections from Storm.
>>
>> Have been playing around with the numbers, the bug (assuming it is a
>> bug) seems to be easier to trigger when I increase the number of
>> incoming connections (from Storm workers), even if they are a lot less
>> than the rainbows servers can take (60-70% of the max connections is
>> usually enough). The bug is also easier to trigger when we increase
>> the volume of data we're pushing through Storm - hundreds or thousands
>> of requests per minute, no bug - hundreds of thousands of requests per
>> minute, yes bug. Cassandra is not the issue, it can easily take the
>> write load we're generating and is basically idle.
>>
>> Any help in figuring this out would be greatly appreciated. Thanks,
>
> Try my suggestions above.
>
> I would also search your libs/gems for what's writing "\0" since I don't
> think it's Rainbows!...
> _______________________________________________
> Rainbows! mailing list - rainbows-talk-GrnCvJ7WPxnNLxjTenLetw@public.gmane.org
> http://rubyforge.org/mailman/listinfo/rainbows-talk
> Do not quote signatures (like this one) or top post when replying
_______________________________________________
Rainbows! mailing list - rainbows-talk-GrnCvJ7WPxnNLxjTenLetw@public.gmane.org
http://rubyforge.org/mailman/listinfo/rainbows-talk
Do not quote signatures (like this one) or top post when replying


^ permalink raw reply	[flat|nested] 4+ messages in thread

[parent not found: <CAAsVqN91_KiexPc4u9byuVstyknj7TgT4V7evs1x65W+6izzkQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]

* Re: Rainbows! + EventMachine + Sinatra::Synchrony == pegged CPU when idle?
       [not found]         ` <CAAsVqN91_KiexPc4u9byuVstyknj7TgT4V7evs1x65W+6izzkQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2012-07-24  0:21           ` Eric Wong
  0 siblings, 0 replies; 4+ messages in thread
From: Eric Wong @ 2012-07-24  0:21 UTC (permalink / raw)
  To: Rainbows! list

Ilya Maykov <ivmaykov-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> Hi Eric,
> 
> Sorry for the delayed response. I've added inline answers to your
> questions. We've since resolved this issue by disabling keepalive in
> our rainbows config. So, this probably had to do with the keepalive
> implementation either in Rainbows itself or in the base Unicorn code.
> Answers to your other questions are inlined below.

Thank you very much for the follow-up (I wish more folks would do this
:)

If Rainbows! is using EventMachine, it'll use the
EM.set_comm_inactivity_timeout method in EventMachine.

> On Tue, Jun 19, 2012 at 10:54 AM, Eric Wong <normalperson-rMlxZR9MS24@public.gmane.org> wrote:
> > Ilya Maykov <ivmaykov-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> > Can you also try "keepalive_timeout 0" to disable keepalive?  (EM
> > handles it internally, but I'm not sure how well)
> >
> 
> This turned out to be the problem. Disabling keepalive got rid of the
> CPU pegging. Surprisingly, it also made our average latency drop from
> about 50 ms to about 20 ms per request, even though every request now
> has to negotiate a connection handshake. So, we're just going to keep
> it disabled for now. The bug may be inside the keepalive code in
> Rainbows or Unicorn (not familiar with the codebase so not sure where
> that code lives).

Are you setting Content-Length or "Transfer-Encoding: chunked" in
responses?  Rack::ContentLength or Rack::Chunked middleware might need
to be loaded if your framework doesn't already include it.

Lack of these headers may confuse clients, even...  Where did you
measure the 50 -> 20ms latency drop, from the client?

About the latency drop:

  Was :tcp_cork enabled in your listen directive?  I wonder if there's
  some bad interaction with :tcp_cork + EM which might explain the
  latency (but not the CPU usage).  Disabling keepalive would force data
  out immediately and avoid any bad effects of :tcp_cork.j

  Can you also try listen() with :tcp_defer_accept => 0?  That might
  help if you're accept()-ing many connections at once.

Anyways, the TCP connection handshake is very fast on LANs/localhost,
but hurts on high-latency connections (I seem to remember mainstream web
browsers double simultaneous requests to compensate for lack of
keepalive).

Which version of EM are you using?  The keepalive implementation for
EM+Rainbows! is entirely handled by EM.
_______________________________________________
Rainbows! mailing list - rainbows-talk-GrnCvJ7WPxnNLxjTenLetw@public.gmane.org
http://rubyforge.org/mailman/listinfo/rainbows-talk
Do not quote signatures (like this one) or top post when replying

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2012-07-24  0:21 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-06-19  9:06 Rainbows! + EventMachine + Sinatra::Synchrony == pegged CPU when idle? Ilya Maykov
     [not found] ` <CAAsVqN8=HxMbbQkfonojsoCHgJAPsDhC7hqqcBK9GnYsO2ww7g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2012-06-19 17:54   ` Eric Wong
     [not found]     ` <20120619175409.GA27303-yBiyF41qdooeIZ0/mPfg9Q@public.gmane.org>
2012-07-23 23:34       ` Ilya Maykov
     [not found]         ` <CAAsVqN91_KiexPc4u9byuVstyknj7TgT4V7evs1x65W+6izzkQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2012-07-24  0:21           ` Eric Wong

Code repositories for project(s) associated with this public inbox

	https://yhbt.net/rainbows.git/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).