HTTP 500 errors from Beta Cluster Wikifunctions health-check API endpoint
Closed, ResolvedPublicBUG REPORT

Description

We're getting alerts that the Beta Cluster health endpoint is unhealthy.

Curling the health-check API endpoint every five seconds, only about 1 in 10 requests suceeds:

$ while true ; do curl -s 'https://wikifunctions.beta.wmflabs.org/w/api.php?action=wikilambda_health_check&format=json' ; echo ; sleep 5 ; done | TZ=UTC ts
Aug 11 15:40:33 {"query":{"wikilambda_health_check":{"success":"false","total_tests":3,"tested":2,"passed":1,"error":"Server error: `POST http://deployment-docker-wikifunctions01.deployment-prep.eqiad1.wikimedia.cloud:6254/1/v1/evaluate/` resulted in a `500 Internal Server Error` response:\n{\"status\":500,\"type\":\"internal_error\",\"title\":\"FetchError\",\"detail\":\"invalid json response body at https://wikifunctions (truncated...)\n"}}}
Aug 11 15:40:40 {"query":{"wikilambda_health_check":{"success":"false","total_tests":3,"tested":2,"passed":1,"error":"Server error: `POST http://deployment-docker-wikifunctions01.deployment-prep.eqiad1.wikimedia.cloud:6254/1/v1/evaluate/` resulted in a `500 Internal Server Error` response:\n{\"status\":500,\"type\":\"internal_error\",\"title\":\"FetchError\",\"detail\":\"invalid json response body at https://wikifunctions (truncated...)\n"}}}
Aug 11 15:40:47 {"query":{"wikilambda_health_check":{"success":"false","total_tests":3,"tested":2,"passed":1,"error":"Server error: `POST http://deployment-docker-wikifunctions01.deployment-prep.eqiad1.wikimedia.cloud:6254/1/v1/evaluate/` resulted in a `500 Internal Server Error` response:\n{\"status\":500,\"type\":\"internal_error\",\"title\":\"FetchError\",\"detail\":\"invalid json response body at https://wikifunctions (truncated...)\n"}}}
Aug 11 15:40:53 {"query":{"wikilambda_health_check":{"success":"false","total_tests":3,"tested":2,"passed":1,"error":"Server error: `POST http://deployment-docker-wikifunctions01.deployment-prep.eqiad1.wikimedia.cloud:6254/1/v1/evaluate/` resulted in a `500 Internal Server Error` 
[...]

Event Timeline

I see this error occur every few seconds in the function-orchestrator log on deployment-docker-wikifunctions01:

$ sudo journalctl --pager-end --output=short --unit=mediawiki-services-function-orchestrator
[...]
{
  "name": "function-orchestrator",
  "hostname": "77979a715121",
  "pid": 94,
  "level": "ERROR",
  "message": "500: internal_error",
  "stack": "FetchError: invalid json response body at https://wikifunctions.beta.wmflabs.org/w/api.php?action=wikilambda_fetch&format=json&zids=Z7%7CZ9%7CZ8%7CZ17%7CZ6%7CZ12%7CZ14 reason: Unexpected token < in JSON at position 0\n    at /srv/service/node_modules/node-fetch/lib/index.js:273:32\n    at tryCatcher (/srv/service/node_modules/bluebird/js/release/util.js:16:23)\n    at Promise._settlePromiseFromHandler (/srv/service/node_modules/bluebird/js/release/promise.js:547:31)\n    at Promise._settlePromise (/srv/service/node_modules/bluebird/js/release/promise.js:604:18)\n    at Promise._settlePromise0 (/srv/service/node_modules/bluebird/js/release/promise.js:649:10)\n    at Promise._settlePromises (/srv/service/node_modules/bluebird/js/release/promise.js:729:18)\n    at _drainQueueStep (/srv/service/node_modules/bluebird/js/release/async.js:93:12)\n    at _drainQueue (/srv/service/node_modules/bluebird/js/release/async.js:86:9)\n    at Async._drainQueues (/srv/service/node_modules/bluebird/js/release/async.js:102:5)\n    at Immediate.Async.drainQueues [as _onImmediate] (/srv/service/node_modules/bluebird/js/release/async.js:15:14)\n    at processImmediate (node:internal/timers:466:21)",
  "status": 500,
  "type": "internal_error",
  "detail": "invalid json response body at https://wikifunctions.beta.wmflabs.org/w/api.php?action=wikilambda_fetch&format=json&zids=Z7%7CZ9%7CZ8%7CZ17%7CZ6%7CZ12%7CZ14 reason: Unexpected token < in JSON at position 0",
  "request_id": "5da215a0-198d-11ed-815c-b365900d75e5",
  "request": {
    "url": "/1/v1/evaluate/",
    "headers": {
      "content-type": "application/json",
      "user-agent": "wikifunctions-request/1.39.0-alpha",
      "content-length": "1032",
      "x-request-id": "5da215a0-198d-11ed-815c-b365900d75e5"
    },
    "method": "POST",
    "params": {
      "0": "/1/v1/evaluate/"
    },
    "query": {},
    "remoteAddress": "REDACTED",
    "remotePort": 36436
  },
  "levelPath": "error/500",
  "msg": "500: internal_error",
  "time": "2022-08-11T15:50:47.054Z",
  "v": 0
}
[...]

I have not correlated the timestamps with the monitoring probe's.

Change 822632 had a related patch set uploaded (by Jforrester; author: Jforrester):

[mediawiki/extensions/WikiLambda@master] tests: Temporarily disable ApiFunctionCallTest due to Beta Cluster flapping

https://gerrit.wikimedia.org/r/822632

Change 822632 merged by jenkins-bot:

[mediawiki/extensions/WikiLambda@master] tests: Temporarily disable ApiFunctionCallTest due to Beta Cluster flapping

https://gerrit.wikimedia.org/r/822632

Change 824429 had a related patch set uploaded (by Jforrester; author: Jforrester):

[mediawiki/extensions/WikiLambda@master] tests: Re-enable ApiFunctionCallTest calls to Beta Cluster

https://gerrit.wikimedia.org/r/824429

Change 824429 merged by jenkins-bot:

[mediawiki/extensions/WikiLambda@master] tests: Re-enable ApiFunctionCallTest calls to Beta Cluster

https://gerrit.wikimedia.org/r/824429

We're seeing errors again.

There are no outstanding issues that are specific to the Beta Cluster environment, AFAIK.