ferm sometimes fails to restart on Kubernetes workers via xtables lock held by kube-proxy
Closed, ResolvedPublic

Description

On Kubernetes workers ferm sometimes fails to restart (these restarts are e.g. triggered by Puppet if a central Ferm macro gets updated). One example:

Jan 03 19:30:08 mw1465 systemd[1]: Stopped ferm firewall configuration.
Jan 03 19:30:08 mw1465 systemd[1]: Starting ferm firewall configuration...
Jan 03 19:30:08 mw1465 ferm[1868235]: Starting Firewall: ferm
Jan 03 19:30:08 mw1465 ferm[1868274]: Another app is currently holding the xtables lock. Perhaps you want to use the -w option?
Jan 03 19:30:08 mw1465 ferm[1868238]: Failed to run /usr/sbin/iptables-legacy-restore
Jan 03 19:30:08 mw1465 ferm[1868238]: Firewall rules rolled back.
Jan 03 19:30:08 mw1465 ferm[1868281]:  failed!
Jan 03 19:30:08 mw1465 systemd[1]: ferm.service: Main process exited, code=exited, status=1/FAILURE
Jan 03 19:30:08 mw1465 systemd[1]: ferm.service: Failed with result 'exit-code'.
Jan 03 19:30:08 mw1465 systemd[1]: Failed to start ferm firewall configuration.
Jan 08 15:18:08 mw1465 systemd[1]: Starting ferm firewall configuration...

These do not recover automatically with the subsequent Puppet run, apparently because this error condition does not get detected by ferm-status.

We could explore whether there's a way to pass -w to iptables-save/iptables-restore via Ferm (from a quick look that doesn't exist, needs a closer look at the sources)

Related Objects

Event Timeline

Mentioned in SAL (#wikimedia-operations) [2024-01-23T12:17:11Z] <claime> Restarting ferm.service on k8s node mw1495.eqiad.wmnet - T354855

Mentioned in SAL (#wikimedia-operations) [2024-01-25T11:26:43Z] <claime> Restarting ferm.service on k8s node kubernetes2036.codfw.wmnet - T354855

Mentioned in SAL (#wikimedia-operations) [2024-01-29T13:26:33Z] <claime> Restarting ferm.service on k8s node kubernetes2055 - T354855

Mentioned in SAL (#wikimedia-operations) [2024-02-02T12:16:00Z] <claime> Restarting ferm.service on k8s node mw1424 - T354855

Mentioned in SAL (#wikimedia-operations) [2024-02-21T14:08:25Z] <claime> restarted ferm.service on kubernetes2055.codfw.wmnet mw2440.codfw.wmnet mw2297.codfw.wmnet kubernetes2016.codfw.wmnet - T354855

Change 1005978 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] ferm: Check ferm.service status in ferm_status.py

https://gerrit.wikimedia.org/r/1005978

Mentioned in SAL (#wikimedia-operations) [2024-03-04T11:47:54Z] <claime> Disabling puppet on C:profile::firewall::log::ferm to deploy 1005978 - T354855

Mentioned in SAL (#wikimedia-operations) [2024-03-04T11:48:56Z] <claime> Disregard previous puppet disable message, waiting a bit T354855

Mentioned in SAL (#wikimedia-operations) [2024-03-04T12:22:51Z] <claime> Disabling puppet on C:profile::firewall::log::ferm to deploy new ferm_status.py - T354855

Change 1005978 merged by Clément Goubert:

[operations/puppet@production] ferm: Check ferm.service status in ferm_status.py

https://gerrit.wikimedia.org/r/1005978

Mentioned in SAL (#wikimedia-operations) [2024-03-04T12:28:32Z] <claime> Enabling puppet on kubernetes2019 to test new ferm_status.py - T354855

Mentioned in SAL (#wikimedia-operations) [2024-03-04T12:30:53Z] <claime> Enabling puppet on mw2322 to test new ferm_status.py - T354855

Mentioned in SAL (#wikimedia-operations) [2024-03-04T12:33:11Z] <claime> Enabling puppet on puppetboard2003 to test new ferm_status.py - T354855

Mentioned in SAL (#wikimedia-operations) [2024-03-04T12:38:06Z] <claime> Re-enabling puppet on C:profile::firewall::log::ferm to deploy new ferm_status.py - T354855

Clement_Goubert claimed this task.
Clement_Goubert subscribed.

Deployed, puppet now restarts ferm.service if the systemd unit's status is failed.

Change #1031440 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] kubernetes: Space out ferm icinga check

https://gerrit.wikimedia.org/r/1031440

Change #1031440 merged by Clément Goubert:

[operations/puppet@production] kubernetes: Space out ferm icinga check

https://gerrit.wikimedia.org/r/1031440

Clement_Goubert raised the priority of this task from Medium to High.

This issue is biting us again, the time between a puppet run that fails to start ferm.service and the subsequent run is enough to cause issues with eventgate, leading to flurries of Can't enqueue job errors in mediawiki. It is in particular triggered when we add new kubernetes nodes, because that causes all other kubernetes nodes in the cluster to update their ferm rules.

Would adding Restart=on-failure to ferm.service for kubernetes worker be a possible short-term solution @MoritzMuehlenhoff ?

This issue is biting us again, the time between a puppet run that fails to start ferm.service and the subsequent run is enough to cause issues with eventgate, leading to flurries of Can't enqueue job errors in mediawiki. It is in particular triggered when we add new kubernetes nodes, because that causes all other kubernetes nodes in the cluster to update their ferm rules.

Would adding Restart=on-failure to ferm.service for kubernetes worker be a possible short-term solution @MoritzMuehlenhoff ?

Possibly, we'd have to try to find out if that works reliably. ferm is a sysvinit script after all and those are the greatest at passing though error states.

The alternative is to untangle the check from puppet runs and instead have it run as a systemd timer more often.

Change #1051378 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] P:kubernetes:node: Autorestart ferm.service

https://gerrit.wikimedia.org/r/1051378

Change #1051378 merged by Clément Goubert:

[operations/puppet@production] P:kubernetes:node: Autorestart ferm.service

https://gerrit.wikimedia.org/r/1051378

Change #1052283 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] check_ferm: Add -w 5 to iptables check

https://gerrit.wikimedia.org/r/1052283

Change #1052283 merged by Clément Goubert:

[operations/puppet@production] check_ferm: Add -w 5 to iptables check

https://gerrit.wikimedia.org/r/1052283

We've only had one spike of job enqueuing errors since merging Restart=on-failure, I think I'll call this resolved for now, and reopen if we see problems again.