Use systemd to autorestart Celery workers
Closed, ResolvedPublic2 Estimated Story PointsBUG REPORT

Description

Occasionally the Celery workers that power WikiWho will go down, because we start it using init.d. When it dies, it must be manually restarted. We should move them to use systemd instead so they are auto-restarted.

See example implementation and configuration.

Original report

Steps to replicate the issue (include links if applicable):

What happens?:

Chrome thinks for a while (with blue banner showing), then replaces it with a red banner saying, "Error: Refresh or try again later."

What should have happened instead?:

Chrome thinks for a while (with blue banner showing), then banner changes to say that Who Wrote That is in effect.

Other information (browser name/version, screenshots, etc.):

I'm using Chrome. I could reproduce this on both Windows and Chrome OS.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

I am seeing this repeatedly over a few hours. Most recent attempt was at en:Michel de Certeau.

Yesterday, I saw it at en:Ali_La_Pointe in rev. 1093913670 which had an unterminated template (since fixed) which I assumed may have been breaking WWT, however it's not working in the live version either.

Tamzin triaged this task as Unbreak Now! priority.Oct 2 2022, 12:24 AM
Tamzin subscribed.

Persisting here for a few days now (Chrome on ChromeOS). Going to boldly set this as "Unbreak Now!" since the extension is currently unusable for many/all people.

  • Any errors in the console?
  • Does this occur on every page?

API call for given example article (http://wikiwho.wmflabs.org/en/api/v1.0.0-beta/latest_rev_content/Spaatz%20Island/?o_rev_id=true&editor=true&token_id=true&out=true&in=true) returns an expected response

{"article_title":"Spaatz_Island","page_id":2326721,"success":true,"message":null,"revisions":[...]

Does this occur on every page?

I clicked "Random article" ten times, then WWT, yielding three failures in these articles:

TheresNoTime lowered the priority of this task from Unbreak Now! to High.Oct 2 2022, 12:43 AM

Does this occur on every page?

I clicked "Random article" ten times, yielding three failures in these articles:

Thanks @Mathglot, looking like it's articles which haven't yet been indexed by wikiwho (only a guess though...!)

Dropping as to high given the impact is less severe (though we'll take a look at this promptly), and to save generating group alert emails on every comment

In a retest of those ten articles, the three failing articles failed again, whereas the seven working articles worked again this time.

Meanwhile over here:

Error looks like

{
    "info": "Requested data is not currently available in WikiWho database. It will be available soon.",
    "success": false,
    "rev_id": 1081695012,
    "page_title": "Thespidium"
}

repeated a total of five times before the error message pops up in-window.

Retried the five failing examples from Tamzin's ten above; four failed but one worked:

Working off of Wikipedia:Contributor copyright investigations/Ajdebre. Firefox, Windows OS.

Failure rate has dropped from before; yesterday, no page would load WWT for me. Error message is the same as Tamzin described.

I thought I should add that sometimes yellow highlighting fails, even when a given test is otherwise working. I wonder if others have noticed this during your testing? I consider this lower priority, but still worth mentioning; maybe it deserves its own ticket if it still occurs after the main problem of WRT sometimes not loading at all is resolved.

To clarify: when I've labeled a test above as "working", that means that clicking page text will show the pop-up dialog box with who/when and the underlying diff link which when clicked goes to the correct diff page; however within that set of "working" examples, I don't always see the yellow highlighting that I expect to see; it is intermittently present among otherwise working examples.

MusikAnimal added subscribers: Ragesoss, MusikAnimal.

The WikiWho celery workers apparently went down again. I restarted them and I see the server is busy catching up on all the revisions it missed. If this has really been down since Sept 27, it may take a day or so for everything to be back to normal.

Apologies for the interruption! We are going to get the service setup with systemd to prevent this from happening in the future.

For the past month or so, I've also noticed that when Mathglot's issue pops up (no highlight, but everything is working) that my entire tab gets zoomed out. This happens across Firefox and Chrome browsers. Once that happens to a page, WWT will not load normally and I have to try again another time.

I thought I should add that sometimes yellow highlighting fails, even when a given test is otherwise working. I wonder if others have noticed this during your testing? I consider this lower priority, but still worth mentioning; maybe it deserves its own ticket if it still occurs after the main problem of WRT sometimes not loading at all is resolved.

To clarify: when I've labeled a test above as "working", that means that clicking page text will show the pop-up dialog box with who/when and the underlying diff link which when clicked goes to the correct diff page; however within that set of "working" examples, I don't always see the yellow highlighting that I expect to see; it is intermittently present among otherwise working examples.

@Mathglot I'd suggest creating a dedicated task for that bug report, as it looks like @Sennecaster has had similar issues in the past as well :)

@Mathglot I'd suggest creating a dedicated task for that bug report, as it looks like @Sennecaster has had similar issues in the past as well :)

See T320364.

MusikAnimal renamed this task from "Who Wrote That" fails with "Error: Refresh or try again later." to Using systemd to autorestart Celery workers.Mar 16 2023, 5:07 PM
MusikAnimal updated the task description. (Show Details)
MusikAnimal renamed this task from Using systemd to autorestart Celery workers to Use systemd to autorestart Celery workers.May 3 2023, 3:40 AM
MusikAnimal updated the task description. (Show Details)

I spent many an hour trying to get this to work but eventually gave up. I'm foregoing this for the T334891 effort.

For whomever wishes to take a stab at this, my attempt in PR #12 is most of the way there. Celery starts but immediately fails saying it doesn't know what the "wikiwho_api" app is. I think some of it has to do bash and string interpolation. At any rate, the official example from Celery does not work (removing Type=forking will save you a few hours of debugging).

So layman's terms: the bug that was reported here (where the Celery service died) is prone to happening again. We'll try to revisit this again some time in the future.

Some changes made to the PR;

  1. The EnvironmentFile (/etc/conf.d/celery) did not exist — created via cp /etc/default/celeryd /etc/conf.d/celery
  2. The service Type does need to be forking — added Type=forking
  3. You get a module error if not in the correct WorkingDirectory — set WorkingDirectory=/home/wikiwho/wikiwho_api

I then did /etc/init.d/celeryd stop and systemctl enable/start ww_celery.service and the service appears to be running successfully:

root@wikiwho01:~# systemctl status ww_celery.service
● ww_celery.service - Celery Service
     Loaded: loaded (/etc/systemd/system/ww_celery.service; enabled; vendor preset: enabled)
     Active: active (running) since Wed 2023-08-23 10:54:23 UTC; 8min ago
    Process: 2173203 ExecStart=/bin/bash -c \${CELERY_BIN} -A \${CELERY_APP} multi start \${CELERYD_NODES}          --pidfile=\${CELERYD_PID_FILE} --logfile=\${CELERYD_LOG_FILE}          --loglevel=\${CELERYD_LOG_>
      Tasks: 19 (limit: 147308)
     Memory: 14.7G
        CPU: 49min 16.376s
     CGroup: /system.slice/ww_celery.service
             ├─2173207 /home/wikiwho/wikiwho_api/env/bin/python3 -m celery worker --loglevel=WARNING -A wikiwho_api -Q default -c 8 --logfile=/var/log/celery/worker_default%I.log --pidfile=/var/run/celery/worker_d>
             ├─2173210 /home/wikiwho/wikiwho_api/env/bin/python3 -m celery worker --loglevel=WARNING -A wikiwho_api -Q user -c 4 --logfile=/var/log/celery/worker_user%I.log --pidfile=/var/run/celery/worker_user.pi>
             ├─2173213 /home/wikiwho/wikiwho_api/env/bin/python3 -m celery worker --loglevel=WARNING -A wikiwho_api -Q long -c 4 --logfile=/var/log/celery/worker_long%I.log --pidfile=/var/run/celery/worker_long.pi>
             ├─2173238 /home/wikiwho/wikiwho_api/env/bin/python3 -m celery worker --loglevel=WARNING -A wikiwho_api -Q default -c 8 --logfile=/var/log/celery/worker_default%I.log --pidfile=/var/run/celery/worker_d>
             ├─2173239 /home/wikiwho/wikiwho_api/env/bin/python3 -m celery worker --loglevel=WARNING -A wikiwho_api -Q default -c 8 --logfile=/var/log/celery/worker_default%I.log --pidfile=/var/run/celery/worker_d>
             ├─2173263 /home/wikiwho/wikiwho_api/env/bin/python3 -m celery worker --loglevel=WARNING -A wikiwho_api -Q default -c 8 --logfile=/var/log/celery/worker_default%I.log --pidfile=/var/run/celery/worker_d>
             ├─2173264 /home/wikiwho/wikiwho_api/env/bin/python3 -m celery worker --loglevel=WARNING -A wikiwho_api -Q default -c 8 --logfile=/var/log/celery/worker_default%I.log --pidfile=/var/run/celery/worker_d>
             ├─2173265 /home/wikiwho/wikiwho_api/env/bin/python3 -m celery worker --loglevel=WARNING -A wikiwho_api -Q default -c 8 --logfile=/var/log/celery/worker_default%I.log --pidfile=/var/run/celery/worker_d>
             ├─2173290 /home/wikiwho/wikiwho_api/env/bin/python3 -m celery worker --loglevel=WARNING -A wikiwho_api -Q default -c 8 --logfile=/var/log/celery/worker_default%I.log --pidfile=/var/run/celery/worker_d>
             ├─2173291 /home/wikiwho/wikiwho_api/env/bin/python3 -m celery worker --loglevel=WARNING -A wikiwho_api -Q user -c 4 --logfile=/var/log/celery/worker_user%I.log --pidfile=/var/run/celery/worker_user.pi>
             ├─2173292 /home/wikiwho/wikiwho_api/env/bin/python3 -m celery worker --loglevel=WARNING -A wikiwho_api -Q default -c 8 --logfile=/var/log/celery/worker_default%I.log --pidfile=/var/run/celery/worker_d>
             ├─2173294 /home/wikiwho/wikiwho_api/env/bin/python3 -m celery worker --loglevel=WARNING -A wikiwho_api -Q user -c 4 --logfile=/var/log/celery/worker_user%I.log --pidfile=/var/run/celery/worker_user.pi>
             ├─2173295 /home/wikiwho/wikiwho_api/env/bin/python3 -m celery worker --loglevel=WARNING -A wikiwho_api -Q default -c 8 --logfile=/var/log/celery/worker_default%I.log --pidfile=/var/run/celery/worker_d>
             ├─2173296 /home/wikiwho/wikiwho_api/env/bin/python3 -m celery worker --loglevel=WARNING -A wikiwho_api -Q long -c 4 --logfile=/var/log/celery/worker_long%I.log --pidfile=/var/run/celery/worker_long.pi>
             ├─2173297 /home/wikiwho/wikiwho_api/env/bin/python3 -m celery worker --loglevel=WARNING -A wikiwho_api -Q user -c 4 --logfile=/var/log/celery/worker_user%I.log --pidfile=/var/run/celery/worker_user.pi>
             ├─2173298 /home/wikiwho/wikiwho_api/env/bin/python3 -m celery worker --loglevel=WARNING -A wikiwho_api -Q long -c 4 --logfile=/var/log/celery/worker_long%I.log --pidfile=/var/run/celery/worker_long.pi>
             ├─2173299 /home/wikiwho/wikiwho_api/env/bin/python3 -m celery worker --loglevel=WARNING -A wikiwho_api -Q user -c 4 --logfile=/var/log/celery/worker_user%I.log --pidfile=/var/run/celery/worker_user.pi>
             ├─2173300 /home/wikiwho/wikiwho_api/env/bin/python3 -m celery worker --loglevel=WARNING -A wikiwho_api -Q long -c 4 --logfile=/var/log/celery/worker_long%I.log --pidfile=/var/run/celery/worker_long.pi>
             └─2173301 /home/wikiwho/wikiwho_api/env/bin/python3 -m celery worker --loglevel=WARNING -A wikiwho_api -Q long -c 4 --logfile=/var/log/celery/worker_long%I.log --pidfile=/var/run/celery/worker_long.pi>

Aug 23 10:54:22 wikiwho01 systemd[1]: Starting Celery Service...
Aug 23 10:54:23 wikiwho01 bash[2173203]: celery multi v4.3.0 (rhubarb)
Aug 23 10:54:23 wikiwho01 bash[2173203]: > Starting nodes...
Aug 23 10:54:23 wikiwho01 bash[2173203]:         > worker_default@ww_host: OK
Aug 23 10:54:23 wikiwho01 bash[2173203]:         > worker_user@ww_host: OK
Aug 23 10:54:23 wikiwho01 bash[2173203]:         > worker_long@ww_host: OK
Aug 23 10:54:23 wikiwho01 systemd[1]: Started Celery Service.

I noted;

Aug 23 10:53:11 wikiwho01 systemd[1]: /etc/systemd/system/ww_celery.service:14: Ignoring unknown escape sequences: "\${CELERY_BIN} -A \${CELERY_APP} multi start \${CELERYD_NODES}          --pidfile=\${CELERYD_PID_FILE} --logfile=\${CELERYD_LOG_FILE}          --loglevel=\${CELERYD_LOG_LEVEL} \${CELERYD_OPTS}"
[...]

in journalctl -xef -u ww_celery — I saw in a semi-related (?) SO post that perhaps --pidfile=\${CELERYD_PID_FILE} etc should be --pidfile=\\${CELERYD_PID_FILE}, but I'm not sure (and haven't tried).

Set https://github.com/wikimedia/wikiwho_api/pull/12 as no longer a draft — changes have been made live on wikiwho01.wikiwho.eqiad1.wikimedia.cloud

MusikAnimal moved this task from Backlog to Maintenance / tech debt on the WikiWho board.

Resolving! Thanks again, Sammy :)