Move job traffic from rpc/RunSingleJob to REST endpoint
Open, LowPublic

Description

Once T244770 has been deployed and enabled, we need to switch change-prop from calling /rpc/RunSingleJob to calling the REST endpoint.

The config change should be pretty straightforward:

  • Change the jobrunner_uri
  • Add the 'Host' header equal to {{message.meta.domain}}

I propose doing it in 3 stages:

  • Switch labs in it's entirety.
  • Switch 'updateBetaFeaturesUsersCount` job - this one is our favorite to experiment on, since it's so easy to easy to trigger and check and the consequences of failing are so low. Currently it's run a a part of low_traffic_jobs rule, but I think we should actually pull it out and give it it's own rule, to use as a gun pig in all the deployments.
  • Switch everything.

In the process we probably would need to have a new value, jobrunner_rest_uri in values.yaml and then clean it all up after the switch.

Event Timeline

Change 576131 had a related patch set uploaded (by Clarakosi; owner: Clarakosi):
[mediawiki/services/change-propagation/jobqueue-deploy@master] Move beta job traffic from rpc/RunSingleJob to EventBus REST API

https://gerrit.wikimedia.org/r/576131

Change 577677 had a related patch set uploaded (by Clarakosi; owner: Clarakosi):
[operations/puppet@production] cpjobqueue: Add jobrunner_host & videoscaler_host to deployment vars

https://gerrit.wikimedia.org/r/577677

Change 578534 had a related patch set uploaded (by Clarakosi; owner: Clarakosi):
[mediawiki/services/change-propagation/jobqueue-deploy@master] Move beta job traffic from rpc/RunSingleJob to EventBus REST API

https://gerrit.wikimedia.org/r/578534

Change 577677 merged by Alexandros Kosiaris:
[operations/puppet@production] cpjobqueue: Add jobrunner_host & videoscaler_host to deployment vars

https://gerrit.wikimedia.org/r/577677

Change 576131 merged by Ppchelko:
[mediawiki/services/change-propagation/jobqueue-deploy@master] Update config to use jobrunner_host and videoscaler_host vars

https://gerrit.wikimedia.org/r/576131

Merged the config patch, needs to be deployed now.

@Clarakosi after deployment we can clean up puppet and remove jobrunner_uri and videoscaler_uri deployment vars. (Patch needed)

See also T228911 which might might turn out to be related to the /rpc/RunSingleJob.php endpoint. I guess we'll find out once traffic has switched over. What's the timeline for that?

This is very close top of our Platform Engineering Roadmap so hopefully if will get resourced relatively soon. cc @WDoranWMF

@akosiaris This task is needed for k8s we think but we're not sure where it should be tagged for SRE, would you advise us?

@akosiaris This task is needed for k8s we think but we're not sure where it should be tagged for SRE, would you advise us?

I fail to see the connection to k8s to be honest. Care to elaborate?

As ServiceOps, I don't thing we have an actionable but would like to be aware it happened, so I 've tagged it serviceops-radar

I fail to see the connection to k8s to be honest. Care to elaborate?

This is blocked (and blocking) unifying apache config between job runners and app servers. I was assuming that will be a part of k8s project.

I fail to see the connection to k8s to be honest. Care to elaborate?

This is blocked (and blocking) unifying apache config between job runners and app servers. I was assuming that will be a part of k8s project.

Ah, thanks for that clarification. I think it shouldn't be coupled to the k8s project. It could happen either before or after the k8s migration. Given the current staffing and priorities I 'd bet on the after part, but I 'll talk with the rest of the team and have a more concrete answer.

To clarify things a bit:

  • For switching to using the rest endpoint, we need to unify the apache configurations between the appservers and the jobrunners, probably just leaving around the current jobrunner vhosts for the time of the switch.
  • Not needing to implement a separate set of apache configurations for jobrunners on kubernetes would be desirable, so yes, I would like this task to be completed before we switch the jobrunners to kubernetes. It's not a hard requirement though: given we want to do this, it would be better to do it before we migrate to k8s for jobs.
  • We should not wait for the move to kubernetes in order to do this switch though. It should be considered a precondition.
Aklapper added a subscriber: Clarakosi.

Removing inactive task assignee