Turn $wgRunJobsAsync off by default
Closed, ResolvedPublic

Description

The changes of the in-process job execution in MediaWiki 1.22 and 1.23, which basically made job execution asynchronously, is problematic in a not small number of MediaWiki installations out there, as documented in the job queue man page.

We have basically bugs like

In MediaWiki 1-27 this seems to be worse, or at least I've noticed a large increase in the number of users reporting issues with this (on installations upgraded from previous versions), where the job queue is not working and there's no error message or anything warning about the complete failure of the jobs. The only visible effect, and totally unrelated to anyone not familiar with the job queue, is that categories aren't being populated.

In all cases, the solution is to turn $wgRunJobsAsync to false. This should be the default value for new installations, and people that wants to increase performance can test if $wgRunJobsAsync works for them and enable it, or better yet, use other options like a cron job or redis. Default Settings should have settings that work in all environments, and that's currently not the case. Since the current asynchronous execution will never work in all scenarios, and nobody is working on a fix, can we just turn that setting off so it just works?

Event Timeline

Change 306667 had a related patch set uploaded (by Martineznovo):
Turn $wgRunJobsAsync off by default

https://gerrit.wikimedia.org/r/306667

I'd like to see other opinions, to prevent this to be the eternal change made by the same developer and with only one review, where the design decision may be overseen and not properly discussed, making bad decisions last for too long.

For completeness, posting here my comment from gerrit change 306667:

Your changes (rMW6a9e507dc515c6116e691afc2148642bc5421b34) may help in those *documented* scenarios where the job fails, but I'm pretty sure there are other failing scenarios as well. In the initial reports of failures I tried to gather more information of what was failing, without success, and since nobody cared to fix the situation I started to simply suggest setting $wgRunJobsAsync to false. 4 MediaWiki versions have passed to try to fix this, but nothing has been done, so it's time to fix it once and for all. People must be already applying $wgRunJobsAsync = false silently, since those support desk answers appear high in google search results anyway, so let's not keep this agony longer.

This is definitely an area that is more heavily used by so called "third party" MediaWiki deployments (i.e. not the Wikimedia Foundation). I know that @aaron has done a lot of work to try and make processing jobs without a dedicated jobrunner more performant historically and he's been actively working on it again recently (probably in part do to this bug report). I would certainly trust his judgment over my own in this part of the codebase.

Getting defaults right is tricky. We want the out of the box experience of running MediaWiki to be a good one. If performance is crappy then people will choose to install something else. Likewise if stability is crappy they will move on. When there is a tradeoff between the two, choosing stability over performance is often the best decision. I'm not excited to choose a side in this particular discussion right away however. It would be great if we had some sort of empirical data to look at rather than a collection of anecdotes.

If what we are looking for is some sort of broad consensus by experienced MediaWiki installers/maintainers I think that some call for input should be put out on wikitech-l.

In MediaWiki 1-27 this seems to be worse

I think DeferredUpdates should only "enqueue" to jobs in wikimedia. Most of the benefits are only in a multi-data center set up (Or at least a wiki with more than one db server), so it makes a lot less sense to third parties.

I think that support for async job queue should be feature tested during the installer.

Categorization via the job queue enhances the performance?

And the recommended solution for "pages not appearing cats" is to set $wgRunJobsAsync = false?

Please explain to me how turning off all async jobs enhances the performance.

Please explain to me how turning off all async jobs enhances the performance.

Why would anyone explain a statement that wasn't made in this whole task? This is about making the "in-process" job queue work in all cases where it's failing, as a default setting, not about performance.

Why was categorization outsourced to the job queue then if not for performance?

Why was categorization outsourced to the job queue then if not for performance?

Again, that's out of the scope of this task. This task is not about categorization not working, it's about job queue not working in some MediaWiki instances when using the default configuration.

Categorization via the job queue enhances the performance?

[This is mildly off-topic, but might as well answer the question]
In a multi-cluster setup where slave lag is critical (AKA for WMF), yes it does, because it allows us to finely control the rate of db writes. In your typical MediaWiki install, no it does not. If you want to understand the details, read https://www.mediawiki.org/wiki/Requests_for_comment/Master_%26_slave_datacenter_strategy_for_MediaWiki

And the recommended solution for "pages not appearing cats" is to set $wgRunJobsAsync = false?

Please explain to me how turning off all async jobs enhances the performance.

It does not improve performance (Also it doesn't turn off asynchronous jobs, it makes jobs take place in the current request instead instead of after it. No jobs are being turned off, they are just being executed differently). Nobody has claimed it improves performance (Other than in the sense that a system that works in more performent than a system which does not work). It does make your job queue work in the case your webserver configuration does not support RunJobAsync.

[OFFTOPIC]

I ment it in this way:

Delayed categorization was added for performance reasons.

Admins who like performace use $wgJobRunRate = 0;

Delayed categorization fixes no existing problem for the typical wiki installation (as you mentioned) and confuses users (on mediawiki.org pages take up to 20 seconds to appear in categories).

If you want categorization changes to appear immediately as before (also since action=purge wouldn't even help advanced users here) you are forced to use $wgJobRunRate >= 2 which reduces performance for ALL job tasks(!).

Thus, the delayed categorization implementation reduces perfomance when you used $wgJobRunRate = 0 want to see category changes immediately. There is no way to fix it except by making job categorization optional.

Read again.

$wgJobRunRate combined with $wgRunJobsAsync should not add latency.

Job pickup time is probably hurt without https://gerrit.wikimedia.org/r/#/c/309826/ though.

MediaWiki 1.28 .0, and 1660523 pending jobs: Topic:The8wby7f7wmq1q1.

I'm sorry @aaron but looks like your patches haven't fixed the problem yet. Want to consider merging gerrit change 306667? More and more installations are setting $wgRunJobsAsync = false so I don't see what's the point of having a problematic setting as the default that doesn't seem to work on a variety of scenarios.

Change 306667 merged by jenkins-bot:
Turn $wgRunJobsAsync off by default

https://gerrit.wikimedia.org/r/306667

So the problem still exists in MW 1.28?

Change 328294 had a related patch set uploaded (by Martineznovo):
Turn $wgRunJobsAsync off by default

https://gerrit.wikimedia.org/r/328294

Change 328296 had a related patch set uploaded (by Martineznovo):
Turn $wgRunJobsAsync off by default

https://gerrit.wikimedia.org/r/328296

Change 328296 merged by jenkins-bot:
Turn $wgRunJobsAsync off by default

https://gerrit.wikimedia.org/r/328296

Change 328294 merged by jenkins-bot:
Turn $wgRunJobsAsync off by default

https://gerrit.wikimedia.org/r/328294