Page html output on testwiki returns 404
Closed, ResolvedPublic

Description

Steps to reproduce:

  • Hover on a link on a testwiki article that points to another article
  • Page preview doesn't show up

After a bit of debugging it looks like the problem is on RESTBase:

Page previews use /page/summary endpoint to fetch the data.

Event Timeline

Jgiannelos updated the task description. (Show Details)

From debugging:

  • When cache-control is set to no-cache which means thats a pregeneration request then the response is 200 with the actual parsed content
curl -v restbase2016.codfw.wmnet:7233/test.wikipedia.org/v1/page/html/Dog -H "cache-control: no-cache"
...
< HTTP/1.1 200
...
  • For the same request without cache control (regular traffic) then I get a 404:
...
< HTTP/1.1 404
...
Jgiannelos renamed this task from Page previews on testwiki are not working to Page html output on testwiki returns 404.Nov 1 2023, 11:29 AM
Jgiannelos added a subscriber: hnowlan.

This issue may also be affecting other public-facing services like https://en.wikipedia.org/w/rest.php/en.wikipedia.org/v3/transform/wikitext/to/html

Parsoid endpoints are not expected to work for external requests. So this is "working" as expected.

The issue seems to be that the parsoid endpoints send a full URL in redirect responses, e.g. when redirecting from a page URL to a revision URL, e.g. /page/html/Dog to /page/html/Dog/12345. Since the target URL uses an external domain name, it will not be routed to the parsoid cluster, where the Parsoid extension is enabled on MediaWiki. The request will instead be routed to the public API cluster, where the relevant endpoints are unavailable, resulting in a 404 response.

Change 970770 had a related patch set uploaded (by Daniel Kinzler; author: Daniel Kinzler):

[mediawiki/core@master] ParsoidHandler: emit relative URLs in redirects

https://gerrit.wikimedia.org/r/970770

For context, the reason why this was only happening on testwiki is because we have storage disabled.
In that case we don't resolve the exact URL with revision on the outgoing requests to parsoid, leading to a redirect but to the public facing endpoint that is not accessible.

For example:

curl -v -o /dev/null -k https://parsoid-php.discovery.wmnet/w/rest.php/test.wikipedia.org/v3/page/pagebundle/Dog -H "Host: test.wikipedia.org"

< HTTP/1.1 302 Found
< Location: https://test.wikipedia.org/w/rest.php/test.wikipedia.org/v3/page/pagebundle/Dog/575419

If we follow the redirect:

curl -L -v -o /dev/null -k https://parsoid-php.discovery.wmnet/w/rest.php/test.wikipedia.org/v3/page/pagebundle/Dog -H "Host: test.wikipedia.org"
...
< HTTP/2 404
{"messageTranslations":{"en":"The requested relative path (/test.wikipedia.org/v3/page/pagebundle/Dog/575419) did not match any known handler"},"httpCode":404,"httpReason":"Not Found"}

The issue seems to be that the parsoid endpoints send a full URL in redirect responses, e.g. when redirecting from a page URL to a revision URL, e.g. /page/html/Dog to /page/html/Dog/12345. Since the target URL uses an external domain name, it will not be routed to the parsoid cluster, where the Parsoid extension is enabled on MediaWiki. The request will instead be routed to the public API cluster, where the relevant endpoints are unavailable, resulting in a 404 response.

Till very recently, this was enabled everywhere. It was re-disabled as part of https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/965608

Change 970770 merged by jenkins-bot:

[mediawiki/core@master] ParsoidHandler: emit relative URLs in redirects

https://gerrit.wikimedia.org/r/970770

Change 970764 had a related patch set uploaded (by Daniel Kinzler; author: Daniel Kinzler):

[mediawiki/core@wmf/1.42.0-wmf.3] ParsoidHandler: emit relative URLs in redirects

https://gerrit.wikimedia.org/r/970764

Change 970764 merged by jenkins-bot:

[mediawiki/core@wmf/1.42.0-wmf.3] ParsoidHandler: emit relative URLs in redirects

https://gerrit.wikimedia.org/r/970764

Mentioned in SAL (#wikimedia-operations) [2023-11-02T12:21:51Z] <daniel@deploy2002> Started scap: Backport for [[gerrit:970764|ParsoidHandler: emit relative URLs in redirects (T350219 T349001)]]

Mentioned in SAL (#wikimedia-operations) [2023-11-02T12:36:06Z] <daniel@deploy2002> daniel: Backport for [[gerrit:970764|ParsoidHandler: emit relative URLs in redirects (T350219 T349001)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2023-11-02T12:43:28Z] <daniel@deploy2002> Finished scap: Backport for [[gerrit:970764|ParsoidHandler: emit relative URLs in redirects (T350219 T349001)]] (duration: 21m 37s)

Change 971242 had a related patch set uploaded (by Daniel Kinzler; author: Daniel Kinzler):

[mediawiki/services/parsoid@master] Emit relative redirects

https://gerrit.wikimedia.org/r/971242

Change 971242 merged by jenkins-bot:

[mediawiki/services/parsoid@master] Emit relative redirects.

https://gerrit.wikimedia.org/r/971242

[…]

For example:

curl -v -o /dev/null -k https://parsoid-php.discovery.wmnet/w/rest.php/test.wikipedia.org/v3/page/pagebundle/Dog -H "Host: test.wikipedia.org"

< HTTP/1.1 302 Found
< Location: https://test.wikipedia.org/w/rest.php/test.wikipedia.org/v3/page/pagebundle/Dog/575419

If we follow the redirect:

curl -L -v -o /dev/null -k https://parsoid-php.discovery.wmnet/w/rest.php/test.wikipedia.org/v3/page/pagebundle/Dog -H "Host: test.wikipedia.org"
...
< HTTP/2 404
{"messageTranslations":{"en":"The requested relative path (/test.wikipedia.org/v3/page/pagebundle/Dog/575419) did not match any known handler"},"httpCode":404,"httpReason":"Not Found"}

@ssastry Which software component is following the redirect? Is it Envoy, something in Parsoid's Rest handler, something in MediaWiki core's Rest framework, or something low-level like MWHttp/Guzzle/Curl?

I imagine that for most, if not all, of these a redirect follow would generally not retain a Host: header from the first request. I'm guessing that for Parsoid's routes that repeat the domain name, this might work. Although if it's anything like other MW app servers, it fails if addressed through the discovery hostname. If the above patches actually work, what is making it work? The Host header being automatically re-applied, or the parsoid servers allowing requests without a MW-accepted Host header?

For MediaWiki, the way we usually address this kind of use case is through $wgInternalServer and/or by defining wiki hostnames in the loopback map. For example, there is a local Envoy proxy configured on MW pods/hosts that proxies nested MW http requests back to the same server or cluster.

See also $wgLocalHTTPProxy, $wgLocalVirtualHosts, and its use in MWHttpRequest.

Change 971318 had a related patch set uploaded (by Krinkle; author: Krinkle):

[mediawiki/core@master] Rest: Expression reason for T350219 hack in ParsoidHandlerTest

https://gerrit.wikimedia.org/r/971318

I imagine that for most, if not all, of these a redirect follow would generally not retain a Host: header from the first request.

At least CURL will re-use the Host header from the original request, if the redirect is relative:

curl -k -L -v https://text-lb.esams.wikimedia.org/w/rest.php/v1/page/earth -o /dev/null -H "Host: en.wikipedia.org" 2>&1 | egrep -i '^[<>] (get|http|host|location)'
> GET /w/rest.php/v1/page/earth HTTP/2
> Host: en.wikipedia.org
< HTTP/2 301
< location: /w/rest.php/v1/page/Earth
> GET /w/rest.php/v1/page/Earth HTTP/2
> Host: en.wikipedia.org
< HTTP/2 200

This seems like the right thing to do, but the HTTP spec isn't explicit on this. It says that the Host header should be updated "as appropriate".

Change 972001 had a related patch set uploaded (by Subramanya Sastry; author: Subramanya Sastry):

[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.19.0-a5

https://gerrit.wikimedia.org/r/972001

Change 972001 merged by jenkins-bot:

[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.19.0-a5

https://gerrit.wikimedia.org/r/972001

I'm seeing less 404s now (yay), but I'm still getting one when POSTing to https://test.wikipedia.org/api/rest_v1/transform/html/to/wikitext/Mwbot-rs%2FDISPLAYTITLE/554795.

I'm seeing less 404s now (yay), but I'm still getting one when POSTing to https://test.wikipedia.org/api/rest_v1/transform/html/to/wikitext/Mwbot-rs%2FDISPLAYTITLE/554795.

That is likely an unrelated problem. The change that caused the issue on the page/html endpoints should have no effect on the transform endpoint.

Unless perhaps you are relying on stashing, or on loading the HTML from the backend... Can you give me the cURL command for reproducing the issue?

That is likely an unrelated problem. The change that caused the issue on the page/html endpoints should have no effect on the transform endpoint.
Unless perhaps you are relying on stashing, or on loading the HTML from the backend...

My understanding is that transformation relies on the current content for selser to work (@ssastry can correct me if I'm wrong!)

Can you give me the cURL command for reproducing the issue?

$ curl -X POST -F "html=test" -H 'if-match: W/"554795/5d809250-7f5c-11ee-9e62-994af058fd24"' 'https://test.wikipedia.org/api/rest_v1/transform/html/to/wikitext/Mwbot-rs%2FDISPLAYTITLE/554795'
{"type":"https://mediawiki.org/wiki/HyperSwitch/errors/not_found","title":"Not found.","method":"post","uri":"/test.wikipedia.org/v1/transform/html/to/wikitext/Mwbot-rs%2FDISPLAYTITLE/554795"}

(the if-match value is the etag from https://test.wikipedia.org/api/rest_v1/page/html/Mwbot-rs%2FDISPLAYTITLE)

My understanding is that transformation relies on the current content for selser to work (@ssastry can correct me if I'm wrong!)

It can, it depends on what you post. And it only works if you had originally requested the HTML with the "stash" flag set.

I was not aware this mechanism was used by anything other than VE, and VE stopped using RESTbase half a year ago. So we are in the process of removing it from restbase.

$ curl -X POST -F "html=test" -H 'if-match: W/"554795/5d809250-7f5c-11ee-9e62-994af058fd24"' 'https://test.wikipedia.org/api/rest_v1/transform/html/to/wikitext/Mwbot-rs%2FDISPLAYTITLE/554795'
{"type":"https://mediawiki.org/wiki/HyperSwitch/errors/not_found","title":"Not found.","method":"post","uri":"/test.wikipedia.org/v1/transform/html/to/wikitext/Mwbot-rs%2FDISPLAYTITLE/554795"}

So your HTML content to convert is literally "test"?

(the if-match value is the etag from https://test.wikipedia.org/api/rest_v1/page/html/Mwbot-rs%2FDISPLAYTITLE)

Is that the exact request, or does it have stash=true in the query string?

When I fetch content from that URL, I see x-restbase-cache: disabled in the response, as expected. This also means stashing is disabled.

Because of this, I would expect the if-match to fail with status 412. Gettign 404 is odd. I'll have a look.

In any case, can you try using the core endpoints instead? Core's transform endpoint isn't public under v1 yet, but you can already test it on testwiki. And it will soon be public, see T350661.

The new endpoints are:

My understanding is that transformation relies on the current content for selser to work (@ssastry can correct me if I'm wrong!)

It can, it depends on what you post. And it only works if you had originally requested the HTML with the "stash" flag set.

That's not my experience, I've never set the stash parameter/flag but selser still applies.

I was not aware this mechanism was used by anything other than VE, and VE stopped using RESTbase half a year ago. So we are in the process of removing it from restbase.

transformation of content from HTML to wikitext? Yes it's used by a number of my bots and is a public API endpoint so...

$ curl -X POST -F "html=test" -H 'if-match: W/"554795/5d809250-7f5c-11ee-9e62-994af058fd24"' 'https://test.wikipedia.org/api/rest_v1/transform/html/to/wikitext/Mwbot-rs%2FDISPLAYTITLE/554795'
{"type":"https://mediawiki.org/wiki/HyperSwitch/errors/not_found","title":"Not found.","method":"post","uri":"/test.wikipedia.org/v1/transform/html/to/wikitext/Mwbot-rs%2FDISPLAYTITLE/554795"}

So your HTML content to convert is literally "test"?

The actual Rust test case that was failing was posting the page's HTML back at it to see how it roundtrips, but I couldn't figure out how to escape the HTML for the command line so I just tried with a plain string and it reproduced the 404 error so..

(the if-match value is the etag from https://test.wikipedia.org/api/rest_v1/page/html/Mwbot-rs%2FDISPLAYTITLE)

Is that the exact request, or does it have stash=true in the query string?

That's the exact request.

In any case, can you try using the core endpoints instead? Core's transform endpoint isn't public under v1 yet, but you can already test it on testwiki. And it will soon be public, see T350661.

The new endpoints are:

I can try that later. Do these API endpoints support the same level of concurrency as the existing APIs?

No, I'm still getting 404s with this:

[parsoid/src/api.rs:320] &req = Request {
    method: POST,
    url: Url {
        scheme: "https",
        cannot_be_a_base: false,
        username: "",
        password: None,
        host: Some(
            Domain(
                "test.wikipedia.org",
            ),
        ),
        port: None,
        path: "/w/rest.php/v1/transform/html/to/wikitext/Mwbot-rs%2FDISPLAYTITLE",
        query: None,
        fragment: None,
    },
    headers: {
        "content-type": "application/x-www-form-urlencoded",
        "accept": "text/html; charset=utf-8; profile=\"https://www.mediawiki.org/wiki/Specs/HTML/2.8.0\"",
    },
}
[parsoid/src/api.rs:323] &resp = Response {
    url: Url {
        scheme: "https",
        cannot_be_a_base: false,
        username: "",
        password: None,
        host: Some(
            Domain(
                "test.wikipedia.org",
            ),
        ),
        port: None,
        path: "/w/rest.php/v1/transform/html/to/wikitext/Mwbot-rs%2FDISPLAYTITLE",
        query: None,
        fragment: None,
    },
    status: 404,
    headers: {
        "date": "Fri, 10 Nov 2023 19:36:19 GMT",
        "server": "mw2296.codfw.wmnet",
        "x-content-type-options": "nosniff",
        "cache-control": "no-cache",
        "access-control-allow-origin": "*",
        "vary": "Accept-Encoding",
        "content-length": "189",
        "content-type": "application/json",
        "age": "0",
        "x-cache": "cp1089 miss, cp1087 pass",
        "x-cache-status": "pass",
        "server-timing": "cache;desc=\"pass\", host;desc=\"cp1087\"",
        "strict-transport-security": "max-age=106384710; includeSubDomains; preload",
        "report-to": "{ \"group\": \"wm_nel\", \"max_age\": 604800, \"endpoints\": [{ \"url\": \"https://intake-logging.wikimedia.org/v1/events?stream=w3c.reportingapi.network_error&schema_uri=/w3c/reportingapi/network_error/1.0.0\" }] }",
        "nel": "{ \"report_to\": \"wm_nel\", \"max_age\": 604800, \"failure_fraction\": 0.05, \"success_fraction\": 0.0}",
        "set-cookie": "WMF-Last-Access=10-Nov-2023;Path=/;HttpOnly;secure;Expires=Tue, 12 Dec 2023 12:00:00 GMT",
        "set-cookie": "WMF-Last-Access-Global=10-Nov-2023;Path=/;Domain=.wikipedia.org;HttpOnly;secure;Expires=Tue, 12 Dec 2023 12:00:00 GMT",
        "set-cookie": "GeoIP=CA:ON:Toronto:43.68:-79.35:v4; Path=/; secure; Domain=.wikipedia.org",
        "set-cookie": "NetworkProbeLimit=0.001;Path=/;Secure;Max-Age=3600",
        "x-client-ip": "205.189.187.4",
    },
}

That's not my experience, I've never set the stash parameter/flag but selser still applies.

There is still a best-effort which will usually work. But IIRC, if-match will only work with an etag returned when using the stash flag.

transformation of content from HTML to wikitext? Yes it's used by a number of my bots and is a public API endpoint so...

I was referring to stashing.

To what degree the parsoid endpoints have ever been considered stable for 3rd party use is a bit unclear. But I agree that we should have a public endpoint for performing these transformations.

The actual Rust test case that was failing was posting the page's HTML back at it to see how it roundtrips, but I couldn't figure out how to escape the HTML for the command line so I just tried with a plain string and it reproduced the 404 error so..

Ok. I'll try to figure out what's going on with that.

I can try that later. Do these API endpoints support the same level of concurrency as the existing APIs?

In principle, yes. In practice, it's handled by a different cluster of servers, so who knows.

No, I'm still getting 404s with this:

Ah right, sorry, I gave you the wrong endpoint URLs. These *will* be correct when this weeks train rolls out. Until then, it's https://test.wikipedia.org/w/rest.php/coredev/v0/transform/html/to/wikitext/Mwbot-rs%2FDISPLAYTITLE/554795. With the if-match header, I get a status 412, as expected. With the header omitted, the request goes through. The response is empty, though - not quite sure what's going on. I don't remember if this endpoint is designed to handle form data. Perhaps you'd have to send JSON.

The actual Rust test case that was failing was posting the page's HTML back at it to see how it roundtrips, but I couldn't figure out how to escape the HTML for the command line so I just tried with a plain string and it reproduced the 404 error so..

Ok. I'll try to figure out what's going on with that.

Looks like this is an actual bug. https://github.com/wikimedia/restbase/pull/1337 should fix it.

@Legoktm This works now:

curl -X POST -F "html=test" 'https://en.wikipedia.org/w/rest.php/v1/transform/html/to/wikitext/Mwbot-rs%2FDISPLAYTITLE'

Coud you make your bot use this endpoint instead of the rest_v1 one? And could you also use the new endpoint for fetching HTML, e.g.

curl 'https://en.wikipedia.org/w/rest.php/v1/page/Main_Page/with_html'

The plan is to remove the parsoid specific endpoint along with RESTbase soon-ish: T334238: Create deprecation plan for public parsoid endpoints

Looks like this is an actual bug. https://github.com/wikimedia/restbase/pull/1337 should fix it.

This is currently stuck because apparently, RESTbase CI is entirely b0rked on Github.

@Legoktm This works now:

curl -X POST -F "html=test" 'https://en.wikipedia.org/w/rest.php/v1/transform/html/to/wikitext/Mwbot-rs%2FDISPLAYTITLE'

Note that this page exists on testwiki, but yes, running that curl command against test.wikipedia.org seems to work. Yay!

Coud you make your bot use this endpoint instead of the rest_v1 one? And could you also use the new endpoint for fetching HTML, e.g.

curl 'https://en.wikipedia.org/w/rest.php/v1/page/Main_Page/with_html'

The plan is to remove the parsoid specific endpoint along with RESTbase soon-ish: T334238: Create deprecation plan for public parsoid endpoints

These endpoints don't expose the same functionality that RESTBase did (based on the documentation). For example, there's no way to get the HTML of an old revision. How are redirects handled? Happy to file tickets for missing functionality or comment elsewhere, but right now it's incomplete for switching.

I can run tests against the transform endpoint in the next few days to make sure it works fine.

These endpoints don't expose the same functionality that RESTBase did (based on the documentation). For example, there's no way to get the HTML of an old revision. How are redirects handled? Happy to file tickets for missing functionality or comment elsewhere, but right now it's incomplete for switching.

The core endpoints are structured a bit differently, but should cover all functionality of the restbase endpoints. Old revisions can be accessed using e.g. https://en.wikipedia.org/w/rest.php/v1/revision/764138197/with_html.

All page endpoints will handle title normalization as HTTP redirects. The /html and /with_html endpoints will also follow wiki redirects (as well as magic variant redirects) unless redirect=no is specified, try https://en.wikipedia.org/w/rest.php/v1/page/USA/with_html. The documentation iswas a bit incomplete in that regard.

Please file tickets for any missing functionality, rather than discussing it here.

I can run tests against the transform endpoint in the next few days to make sure it works fine.

Thank you!

I can run tests against the transform endpoint in the next few days to make sure it works fine.

@Legoktm: Any news on this?

Just curious, this isn't blocking anything for me.

daniel claimed this task.

@Jgiannelos deployed a new version of RESTbase today. This works for me now:

curl -X POST -F "html=test" -H 'if-match: W/"554795/5d809250-7f5c-11ee-9e62-994af058fd24"' 'https://test.wikipedia.org/api/rest_v1/transform/html/to/wikitext/Mwbot-rs%2FDISPLAYTITLE/554795'

That resolves the last remaining issue. Closing.

Change 971318 abandoned by Krinkle:

[mediawiki/core@master] Rest: Document reason for T350219 hack in ParsoidHandlerTest

Reason:

https://gerrit.wikimedia.org/r/971318

Change 971318 had a related patch set uploaded (by Krinkle; author: Krinkle):

[mediawiki/core@master] Rest: Expression reason for T350219 hack in ParsoidHandlerTest

https://gerrit.wikimedia.org/r/971318

For future reference: The code in question was removed in https://gerrit.wikimedia.org/r/c/mediawiki/core/+/971308 per T350359: Parsoid extension page endpoints should not redirect to latest revision.