Thumbor fails to render PNG with "Failed to convert image convert: IDAT: invalid distance too far back", returns 429 "Too Many Requests"
Closed, DeclinedPublic

Description

See this thread on enwiki VPT for background.

When I load either of:

https://en.wikipedia.org/wiki/File:Young_Folks%27_History_of_Rome_illus090.png
https://commons.wikimedia.org/wiki/File:Young_Folks%27_History_of_Rome_illus090.png

The thumbnail shows with a broken image icon. In my browser console, I get:

Request URL: https://upload.wikimedia.org/wikipedia/commons/thumb/a/a1/Young_Folks%27_History_of_Rome_illus090.png/120px-Young_Folks%27_History_of_Rome_illus090.png
Request Method: GET
Status Code: 429 
Remote Address: 208.80.153.240:443
Referrer Policy: origin-when-cross-origin
content-length: 1821
content-type: text/html; charset=utf-8
date: Wed, 30 Jun 2021 15:11:20 GMT
nel: { "report_to": "wm_nel", "max_age": 86400, "failure_fraction": 0.05, "success_fraction": 0.0}
permissions-policy: interest-cohort=()
report-to: { "group": "wm_nel", "max_age": 86400, "endpoints": [{ "url": "https://intake-logging.wikimedia.org/v1/events?stream=w3c.reportingapi.network_error&schema_uri=/w3c/reportingapi/network_error/1.0.0" }] }
server: Varnish
server-timing: cache;desc="int-front", host;desc="cp2040"
strict-transport-security: max-age=106384710; includeSubDomains; preload
x-cache: cp2040 int
x-cache-status: int-front
x-client-ip: 69.115.77.204

Event Timeline

Xaosflux renamed this task from Image load failing with 409 from varnish to Image load failing with 429 from varnish.Jun 30 2021, 3:22 PM

From skimming c:User:Helix84/gallery, where a bunch of images fail to load in this way, it looks like most or all of the affected images have names of the form Young Folks' History of Rome illusNNN.png, where NNN is a three-digit number. Also, I get the occasional 500 for these images as well, but very rarely.

Legoktm triaged this task as Unbreak Now! priority.Jul 1 2021, 12:35 AM
Legoktm subscribed.

In general we've made some changes recently to rate limiting after repeated abuse/DDoS attacks. Could you please clarify what software (e.g. Firefox, Chrome, some other tool) you're using to access pages/images that is returning 429s?

I can reproduce https://upload.wikimedia.org/wikipedia/commons/thumb/a/a1/Young_Folks%27_History_of_Rome_illus090.png/120px-Young_Folks%27_History_of_Rome_illus090.png not working with a 429, so I'm looking into that.

Could you please clarify what software (e.g. Firefox, Chrome, some other tool) you're using to access pages/images that is returning 429s?

I'm not sure who that was intended for, but I get similar results on all of:

Chrome Version 91.0.4472.114 (Official Build) (x86_64)
Safari Version 14.1.1 (16611.2.7.1.4)
Firefox 88.0.1 (64-bit)
or curl in a terminal window
All running on macOS Big Sur (11.4)

With curl, I get:

Roys-MacBook-Pro:~ roy$ curl -v https://upload.wikimedia.org/wikipedia/commons/thumb/a/a1/Young_Folks%27_History_of_Rome_illus090.png/120px-Young_Folks%27_History_of_Rome_illus090.png
*   Trying 208.80.153.240...
* TCP_NODELAY set
* Connected to upload.wikimedia.org (208.80.153.240) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /etc/ssl/cert.pem
  CApath: none
* TLSv1.2 (OUT), TLS handshake, Client hello (1):
* TLSv1.2 (IN), TLS handshake, Server hello (2):
* TLSv1.2 (IN), TLS handshake, Certificate (11):
* TLSv1.2 (IN), TLS handshake, Server key exchange (12):
* TLSv1.2 (IN), TLS handshake, Server finished (14):
* TLSv1.2 (OUT), TLS handshake, Client key exchange (16):
* TLSv1.2 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.2 (OUT), TLS handshake, Finished (20):
* TLSv1.2 (IN), TLS change cipher, Change cipher spec (1):
* TLSv1.2 (IN), TLS handshake, Finished (20):
* SSL connection using TLSv1.2 / ECDHE-ECDSA-AES256-GCM-SHA384
* ALPN, server accepted to use h2
* Server certificate:
*  subject: CN=*.wikipedia.org
*  start date: May 16 08:01:46 2021 GMT
*  expire date: Aug 14 08:01:46 2021 GMT
*  subjectAltName: host "upload.wikimedia.org" matched cert's "*.wikimedia.org"
*  issuer: C=US; O=Let's Encrypt; CN=R3
*  SSL certificate verify ok.
* Using HTTP2, server supports multi-use
* Connection state changed (HTTP/2 confirmed)
* Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
* Using Stream ID: 1 (easy handle 0x7fad2d00d600)
> GET /wikipedia/commons/thumb/a/a1/Young_Folks%27_History_of_Rome_illus090.png/120px-Young_Folks%27_History_of_Rome_illus090.png HTTP/2
> Host: upload.wikimedia.org
> User-Agent: curl/7.64.1
> Accept: */*
> 
* Connection state changed (MAX_CONCURRENT_STREAMS == 100)!
< HTTP/2 429 
< date: Thu, 01 Jul 2021 01:00:44 GMT
< server: Varnish
< x-cache: cp2040 int
< x-cache-status: int-front
< server-timing: cache;desc="int-front", host;desc="cp2040"
< strict-transport-security: max-age=106384710; includeSubDomains; preload
< report-to: { "group": "wm_nel", "max_age": 86400, "endpoints": [{ "url": "https://intake-logging.wikimedia.org/v1/events?stream=w3c.reportingapi.network_error&schema_uri=/w3c/reportingapi/network_error/1.0.0" }] }
< nel: { "report_to": "wm_nel", "max_age": 86400, "failure_fraction": 0.05, "success_fraction": 0.0}
< permissions-policy: interest-cohort=()
< set-cookie: WMF-Last-Access=01-Jul-2021;Path=/;HttpOnly;secure;Expires=Mon, 02 Aug 2021 00:00:00 GMT
< x-client-ip: 69.115.77.204
< content-type: text/html; charset=utf-8
< content-length: 1821
< 
<!DOCTYPE html>
<html lang="en">
<meta charset="utf-8">
<title>Wikimedia Error</title>
<style>
* { margin: 0; padding: 0; }
body { background: #fff; font: 15px/1.6 sans-serif; color: #333; }
.content { margin: 7% auto 0; padding: 2em 1em 1em; max-width: 640px; }
.footer { clear: both; margin-top: 14%; border-top: 1px solid #e5e5e5; background: #f9f9f9; padding: 2em 0; font-size: 0.8em; text-align: center; }
img { float: left; margin: 0 2em 2em 0; }
a img { border: 0; }
h1 { margin-top: 1em; font-size: 1.2em; }
.content-text { overflow: hidden; overflow-wrap: break-word; word-wrap: break-word; -webkit-hyphens: auto; -moz-hyphens: auto; -ms-hyphens: auto; hyphens: auto; }
p { margin: 0.7em 0 1em 0; }
a { color: #0645ad; text-decoration: none; }
a:hover { text-decoration: underline; }
code { font-family: sans-serif; }
.text-muted { color: #777; }
</style>
<div class="content" role="main">
<a href="https://www.wikimedia.org"><img src="https://www.wikimedia.org/static/images/wmf-logo.png" srcset="https://www.wikimedia.org/static/images/wmf-logo-2x.png 2x" alt="Wikimedia" width="135" height="101">
</a>
<h1>Error</h1>
<div class="content-text">
<p>Our servers are currently under maintenance or experiencing a technical problem.

Please <a href="" title="Reload this page" onclick="window.location.reload(false); return false">try again</a> in a few&nbsp;minutes.</p>

<p>See the error message at the bottom of this page for more&nbsp;information.</p>
</div>
</div>
<div class="footer"><p>If you report this error to the Wikimedia System Administrators, please include the details below.</p><p class="text-muted"><code>Request from 69.115.77.204 via cp2040 cp2040, Varnish XID 579114638<br>Upstream caches: cp2040 int<br>Error: 429, Too Many Requests at Thu, 01 Jul 2021 01:00:44 GMT</code></p>
</div>
</html>
* Connection #0 to host upload.wikimedia.org left intact
* Closing connection 0

Could you please clarify what software (e.g. Firefox, Chrome, some other tool) you're using to access pages/images that is returning 429s?

I'm not sure who that was intended for, but I get similar results on all of:

Anyone who was able to reproduce the issue. But I think I can safely say it's client independent.

One other question, has anyone seen this on full size images? Or just thumbnails?

If you could generate the URLs for the other size images, I'd be happy to give them a try from here.

Legoktm lowered the priority of this task from Unbreak Now! to High.Jul 1 2021, 2:37 AM
Legoktm edited projects, added Thumbor; removed Traffic.
Legoktm added subscribers: colewhite, Gilles.

@colewhite and I dug into this, it appears to be an issue with Thumbor:

1legoktm@cumin1001:~$ curl http://thumbor.svc.codfw.wmnet:8800/wikipedia/commons/thumb/a/a1/Young_Folks%27_History_of_Rome_illus090.png/120px-Young_Folks%27_History_of_Rome_illus090.png -I
2HTTP/1.1 429 Too Many Requests
3Thumbor-Memcache-Get-Time: 1
4Thumbor-Request-Date: Thu, 01 Jul 2021 02:22:10 GMT
5Content-Length: 0
6Server: TornadoServer/4.4.3
7Cache-Control: no-cache
8Date: Thu, 01 Jul 2021 02:22:10 GMT
9Proxy-Request-Date: 01/Jul/2021:02:22:10 +0000
10Content-Type: text/html; charset=UTF-8
11Proxy-Response-Date: 01/Jul/2021:02:22:10 +0000
12X-Upstream: 127.0.0.1:8804
13Thumbor-Request-Id: 0A402019:AC40_0A020118:2260_60DD26D2_289F78C:0B26
14
15legoktm@cumin1001:~$ curl http://thumbor.svc.eqiad.wmnet:8800/wikipedia/commons/thumb/a/a1/Young_Folks%27_History_of_Rome_illus090.png/120px-Young_Folks%27_History_of_Rome_illus090.png -I
16HTTP/1.1 500 Internal Server Error
17Thumbor-Memcache-Get-Time: 1
18Thumbor-Request-Date: Thu, 01 Jul 2021 02:23:00 GMT
19Thumbor-Swift-Original-Success-Time: 27
20Thumbor-Convert-Time: 60
21Thumbor-Convert-Time: 46
22Thumbor-Memcache-Set-Time: 1
23Content-Length: 0
24Server: TornadoServer/4.4.3
25Thumbor-Swift-Thumbnail-Miss-Time: 62
26Thumbor-Poolcounter-Time: 7
27Thumbor-Exiftool-Time: 199
28Thumbor-Exiftool-Time: 216
29Date: Thu, 01 Jul 2021 02:23:00 GMT
30Cache-Control: no-cache
31Proxy-Request-Date: 01/Jul/2021:02:23:00 +0000
32Content-Type: text/html; charset=UTF-8
33Thumbor-Engine: wikimedia_thumbor.engine.imagemagick
34Proxy-Response-Date: 01/Jul/2021:02:23:00 +0000
35X-Upstream: 127.0.0.1:8802
36Thumbor-Request-Id: 0A402019:C9C2_0A020218:2260_60DD2704_2B3FA9B:0771

Looking at https://logstash.wikimedia.org/goto/b74fa1ac65d1c96d08666f798a7f1fad we found that this image has failed to render for ~30 days now, so we're downgrading it from UBN. There's definitely a discrepancy about why the codfw thumbor (now serving user traffic post switchover) returns a 429, incorrectly I think, while eqiad returns a 500.

The specific error message in eqiad is

1Jul 1 02:23:01 thumbor1003 thumbor@8802[11966]: 2021-07-01 02:23:01,314 8802 thumbor:ERROR [ImagesHandler] Exception during _load_results
2Jul 1 02:23:01 thumbor1003 thumbor@8802[11966]: Traceback (most recent call last):
3Jul 1 02:23:01 thumbor1003 thumbor@8802[11966]: File "/usr/lib/python2.7/dist-packages/wikimedia_thumbor/handler/images/images.py", line 581, in _load_results
4Jul 1 02:23:01 thumbor1003 thumbor@8802[11966]: results, content_type = BaseHandler._load_results(self, context)
5Jul 1 02:23:01 thumbor1003 thumbor@8802[11966]: File "/usr/lib/python2.7/dist-packages/thumbor/handlers/__init__.py", line 334, in _load_results
6Jul 1 02:23:01 thumbor1003 thumbor@8802[11966]: results = context.request.engine.read(image_extension, quality)
7Jul 1 02:23:01 thumbor1003 thumbor@8802[11966]: File "/usr/lib/python2.7/dist-packages/wikimedia_thumbor/engine/proxy/proxy.py", line 133, in read
8Jul 1 02:23:01 thumbor1003 thumbor@8802[11966]: ret = self.__getattr__('read')(extension, quality)
9Jul 1 02:23:01 thumbor1003 thumbor@8802[11966]: File "/usr/lib/python2.7/dist-packages/wikimedia_thumbor/engine/imagemagick/imagemagick.py", line 332, in read
10Jul 1 02:23:01 thumbor1003 thumbor@8802[11966]: raise ImageMagickException('Failed to convert image %s' % stderr) # pragma: no cover
11Jul 1 02:23:01 thumbor1003 thumbor@8802[11966]: ImageMagickException: Failed to convert image convert: IDAT: invalid distance too far back `/srv/thumbor/tmp/thumbor@8802/tmpszMzGO' @ error/png.c/MagickPNGErrorHandler/1628.
12Jul 1 02:23:01 thumbor1003 thumbor@8802[11966]: convert: no images defined `png:-' @ error/convert.c/ConvertImageCommand/3258.
13Jul 1 02:23:01 thumbor1003 thumbor@8802[11966]: 2021-07-01 02:23:01,319 8802 tornado.access:ERROR 500 HEAD /wikipedia/commons/thumb/a/a1/Young_Folks%27_History_of_Rome_illus090.png/120px-Young_Folks%27_History_of_Rome_illus090.png (10.64.32.25) 702.34ms
14Jul 1 02:23:01 thumbor1003 thumbor@8802[11966]: 2021-07-01 02:23:01,463 8802 tornado.application:ERROR Future exception was never retrieved: Traceback (most recent call last):
15Jul 1 02:23:01 thumbor1003 thumbor@8802[11966]: File "/usr/lib/python2.7/dist-packages/tornado/gen.py", line 270, in wrapper
16Jul 1 02:23:01 thumbor1003 thumbor@8802[11966]: result = func(*args, **kwargs)
17Jul 1 02:23:01 thumbor1003 thumbor@8802[11966]: File "/usr/lib/python2.7/dist-packages/thumbor/handlers/__init__.py", line 407, in finish_request
18Jul 1 02:23:01 thumbor1003 thumbor@8802[11966]: callback=inner,
19Jul 1 02:23:01 thumbor1003 thumbor@8802[11966]: File "/usr/lib/python2.7/dist-packages/thumbor/context.py", line 287, in queue
20Jul 1 02:23:01 thumbor1003 thumbor@8802[11966]: self._execute_in_foreground(operation, callback)
21Jul 1 02:23:01 thumbor1003 thumbor@8802[11966]: File "/usr/lib/python2.7/dist-packages/thumbor/context.py", line 275, in _execute_in_foreground
22Jul 1 02:23:01 thumbor1003 thumbor@8802[11966]: callback(result)
23Jul 1 02:23:01 thumbor1003 thumbor@8802[11966]: File "/usr/lib/python2.7/dist-packages/thumbor/handlers/__init__.py", line 398, in inner
24Jul 1 02:23:01 thumbor1003 thumbor@8802[11966]: self._write_results_to_client(context, results, content_type)
25Jul 1 02:23:01 thumbor1003 thumbor@8802[11966]: File "/usr/lib/python2.7/dist-packages/wikimedia_thumbor/handler/images/images.py", line 600, in _write_results_to_client
26Jul 1 02:23:01 thumbor1003 thumbor@8802[11966]: BaseHandler._write_results_to_client(self, context, results, content_type)
27Jul 1 02:23:01 thumbor1003 thumbor@8802[11966]: File "/usr/lib/python2.7/dist-packages/thumbor/handlers/__init__.py", line 424, in _write_results_to_client
28Jul 1 02:23:01 thumbor1003 thumbor@8802[11966]: self.set_header('Content-Type', content_type)
29Jul 1 02:23:01 thumbor1003 thumbor@8802[11966]: File "/usr/lib/python2.7/dist-packages/tornado/web.py", line 334, in set_header
30Jul 1 02:23:01 thumbor1003 thumbor@8802[11966]: self._headers[name] = self._convert_header_value(value)
31Jul 1 02:23:01 thumbor1003 thumbor@8802[11966]: File "/usr/lib/python2.7/dist-packages/tornado/web.py", line 379, in _convert_header_value
32Jul 1 02:23:01 thumbor1003 thumbor@8802[11966]: raise TypeError("Unsupported header value %r" % value)
33Jul 1 02:23:01 thumbor1003 thumbor@8802[11966]: TypeError: Unsupported header value None

That suggests something is wrong with the image or in imagemagick.

T226318#5282215 suggests that the 429 vs 500 may be a red herring in that thumbor will refuse to re-render a file it failed to render previously given that it's not going to make a difference.

Legoktm renamed this task from Image load failing with 429 from varnish to Thumbor fails to render PNG with "Failed to convert image convert: IDAT: invalid distance too far back", returns 429 "Too Many Requests".Jul 1 2021, 2:45 AM

@colewhite and I dug into this, it appears to be an issue with Thumbor:

Indeed. What made the investigation harder is that we add a custom error page in Varnish if the origin server returned an error (4xx or 5xx does not matter) with no response body. Here is the response from Thumbor in codfw:

legoktm@cumin1001:~$ curl http://thumbor.svc.codfw.wmnet:8800/wikipedia/commons/thumb/a/a1/Young_Folks%27_History_of_Rome_illus090.png/120px-Young_Folks%27_History_of_Rome_illus090.png -I
HTTP/1.1 429 Too Many Requests
Thumbor-Memcache-Get-Time: 1
Thumbor-Request-Date: Thu, 01 Jul 2021 02:22:10 GMT
Content-Length: 0
Server: TornadoServer/4.4.3

The status code is 4xx, Content-Length is 0, making Varnish take the VCL branch above if (resp.http.Content-Length == "0" && resp.status >= 400) and generate a "synthetic" error page.

$ curl -v https://upload.wikimedia.org/wikipedia/commons/thumb/a/a1/Young_Folks%27_History_of_Rome_illus090.png/120px-Young_Folks%27_History_of_Rome_illus090.png
[...]
< HTTP/2 429 
< date: Thu, 01 Jul 2021 08:45:02 GMT
< server: Varnish
< x-cache: cp4025 int
< x-cache-status: int-front

Obviously by looking at this response one may think that the 429 error came from Varnish, while in fact it was from Thumbor. Ideally, we should preserve the value of the Server response header in this sort of situation if possible, or add another header (eg: X-Original-Sever) otherwise.

This is a libpng error (via image magick). Likely these images were always problematic, but the problem only started showing in more recent png versions after libpng 1.6. https://github.com/glennrp/libpng/issues/125

Probably resaving them with an image program (which isn’t as safe as libpng) and reuploading is the only fix ?

AntiCompositeNumber lowered the priority of this task from High to Medium.Jul 2 2021, 2:17 AM
AntiCompositeNumber moved this task from Backlog to Broken or missing thumbnails on the Thumbor board.

T226318#5282215 suggests that the 429 vs 500 may be a red herring in that thumbor will refuse to re-render a file it failed to render previously given that it's not going to make a difference.

Reader-reported 429s for thumbnails or originals are almost always https://wikitech.wikimedia.org/wiki/Thumbor#Throttling. 429s on a single file with an occasional 500 thrown in is almost certainly the Memcached-based per-original failure throttling. If it's multiple files with all 429s on any sort of gallery page, that would point to the Poolcounter throttles (usually per-IP).

The Thumbor throttling configuration is at rOPUP modules/thumbor/templates/server.conf.erb:149.

The memcached throttle is datacenter-specific, so it is expected that one cluster may throttle an original while the other may not. Swift only repeats the request to the inactive thumborhost if the active thumborhost is successful in generating the thumbnail.


Thanks for the logs. This does appear to be caused by a change in behavior in libpng since the thumbnails for this file were last generated. The old thumbnails were purged from Swift for some reason, and because of the change in behavior, they can't be regenerated. The image displayed on the file description page is still visible because it is the un-transformed original.

The libpng manual suggests using the included pngfix tool to fix files with this error. I've done so for this file using the following command:

$ pngfix --out=Young_Folks\'_History_of_Rome_illus090-fixed.png Young_Folks\'_History_of_Rome_illus090.png
IDAT TFB maximum 9 15 24542 59691 Young_Folks'_History_of_Rome_illus090.png

Unfortunately that's not something that can be done automatically at the Thumbor level, so any other files with this error will have to be fixed manually. I'm able to open the fixed file locally with ImageMagick/libpng, but we'll have to wait for the Memcached throttle to clear (1 hour) before Thumbor will try to render the new thumbnail.

thanks @AntiCompositeNumber for fixing the image and the additional detail. I agree with declining this task.

Would it be useful if I dug through the logs for other such files that don't thumb correctly?

Sure, if you get me a list of originals failing with IDAT: invalid distance too far back I can work on fixing them.