Adam Klein
2015-04-20 20:26:03 UTC
I'm using Celery in most likely an unconventional way: as the basis for a
compute grid. Each worker has a pre-fork pool equal in size to its number
of cores. Each task pulled off the queue induces a worker process to spawn
a new python process (via subprocess.Popen in a new thread that consumes
the subprocess textual output), and awaits this subprocess's exit via a
thread join. These are long-running, computationally intensive processes
(think numpy/scipy stuff).
The problem is, when a worker's pre-fork pool is fully utilized, the worker
no longer responds to polling via Flower or via inspect() from
celery.task.control. On this worker's log, you will also see info-level
messages such as "missed heartbeat from ***@hostname" where hostname are
the other worker machines. Also, "substantial drift from ***@hostname
may mean clocks are out of sync".
However, this is *not* a problem when I only use up to (n-1) of the n
pre-forked processes. It seems that the worker's n'th pool process, which
isn't blocked, can maintain communication with the rest of the grid, and
keep the worker from going "dark".
Are there any thoughts around how to work around this situation?
I'm using RabbitMQ and CELERYD_PREFETCH_MULTIPLIER = 1 and also locally
controlling how many tasks I send onto the work queue as a best effort
right now, but essentially this requires cooperative grid usage among users.
Thanks!
compute grid. Each worker has a pre-fork pool equal in size to its number
of cores. Each task pulled off the queue induces a worker process to spawn
a new python process (via subprocess.Popen in a new thread that consumes
the subprocess textual output), and awaits this subprocess's exit via a
thread join. These are long-running, computationally intensive processes
(think numpy/scipy stuff).
The problem is, when a worker's pre-fork pool is fully utilized, the worker
no longer responds to polling via Flower or via inspect() from
celery.task.control. On this worker's log, you will also see info-level
messages such as "missed heartbeat from ***@hostname" where hostname are
the other worker machines. Also, "substantial drift from ***@hostname
may mean clocks are out of sync".
However, this is *not* a problem when I only use up to (n-1) of the n
pre-forked processes. It seems that the worker's n'th pool process, which
isn't blocked, can maintain communication with the rest of the grid, and
keep the worker from going "dark".
Are there any thoughts around how to work around this situation?
I'm using RabbitMQ and CELERYD_PREFETCH_MULTIPLIER = 1 and also locally
controlling how many tasks I send onto the work queue as a best effort
right now, but essentially this requires cooperative grid usage among users.
Thanks!
--
You received this message because you are subscribed to the Google Groups "celery-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to celery-users+***@googlegroups.com.
To post to this group, send email to celery-***@googlegroups.com.
Visit this group at http://groups.google.com/group/celery-users.
For more options, visit https://groups.google.com/d/optout.
You received this message because you are subscribed to the Google Groups "celery-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to celery-users+***@googlegroups.com.
To post to this group, send email to celery-***@googlegroups.com.
Visit this group at http://groups.google.com/group/celery-users.
For more options, visit https://groups.google.com/d/optout.