Migrate mw_rc_irc servers to Bullseye
Closed, ResolvedPublic

Description

These are still on Buster. To upgrade the existing stack at least two things need to be resolved:

  • build the patched ratbox for bullseye-wikimedia
  • Migrate the Python parts of the broadcasting stack to Python 3

The alternative is to move forward with https://phabricator.wikimedia.org/T234234

Event Timeline

Change 902077 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Move udpmixircecho to Python 3 for Bullseye hosts

https://gerrit.wikimedia.org/r/902077

Change 902077 merged by Muehlenhoff:

[operations/puppet@production] Move udpmixircecho to Python 3 for Bullseye hosts

https://gerrit.wikimedia.org/r/902077

Change 902309 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Add irc2002

https://gerrit.wikimedia.org/r/902309

Change 902309 merged by Muehlenhoff:

[operations/puppet@production] Add irc2002

https://gerrit.wikimedia.org/r/902309

Cookbook cookbooks.sre.ganeti.reimage was started by jmm@cumin2002 for host irc2002.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.ganeti.reimage started by jmm@cumin2002 for host irc2002.wikimedia.org with OS bullseye completed:

  • irc2002 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Set boot to disk
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/ganeti/reimage/202303231044_jmm_947400_irc2002.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed

Change 902325 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Assign mw_rc_irc role to irc2002

https://gerrit.wikimedia.org/r/902325

Change 902325 merged by Muehlenhoff:

[operations/puppet@production] Assign mw_rc_irc role to irc2002

https://gerrit.wikimedia.org/r/902325

Change 902349 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Fix typo in systemctl command

https://gerrit.wikimedia.org/r/902349

Change 902349 merged by Muehlenhoff:

[operations/puppet@production] Fix typo in systemctl command

https://gerrit.wikimedia.org/r/902349

Change 902362 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Fix NRPE check when running under Python 3

https://gerrit.wikimedia.org/r/902362

Change 902362 merged by Muehlenhoff:

[operations/puppet@production] Fix NRPE check when running under Python 3

https://gerrit.wikimedia.org/r/902362

Cookbook cookbooks.sre.ganeti.reimage was started by jmm@cumin2002 for host irc1002.wikimedia.org with OS bullseye

Cookbook cookbooks.sre.ganeti.reimage started by jmm@cumin2002 for host irc1002.wikimedia.org with OS bullseye completed:

  • irc1002 (PASS)
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Set boot to disk
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/ganeti/reimage/202303231503_jmm_1164999_irc1002.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed

Change 905626 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/mediawiki-config@master] Also broadcast RCFeed/IRC events to irc1002/irc2002

https://gerrit.wikimedia.org/r/905626

Change 905626 merged by jenkins-bot:

[operations/mediawiki-config@master] Also broadcast RCFeed/IRC events to irc1002/irc2002

https://gerrit.wikimedia.org/r/905626

Mentioned in SAL (#wikimedia-operations) [2023-04-17T09:04:26Z] <ladsgroup@deploy2002> Started scap: Backport for [[gerrit:905626|Also broadcast RCFeed/IRC events to irc1002/irc2002 (T331702)]]

Mentioned in SAL (#wikimedia-operations) [2023-04-17T09:12:54Z] <ladsgroup@deploy2002> jmm and ladsgroup: Backport for [[gerrit:905626|Also broadcast RCFeed/IRC events to irc1002/irc2002 (T331702)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet

Mentioned in SAL (#wikimedia-operations) [2023-04-17T09:48:45Z] <ladsgroup@deploy2002> Finished scap: Backport for [[gerrit:905626|Also broadcast RCFeed/IRC events to irc1002/irc2002 (T331702)]] (duration: 44m 21s)

Icinga downtime and Alertmanager silence (ID=a3bb7d5d-06b6-47e0-986b-d299e4bb9639) set by jmm@cumin2002 for 2 days, 0:00:00 on 1 host(s) and their services with reason: Non-functional, WIP for Bullseye update

irc2002.wikimedia.org

Change #1043038 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] udpmxircecho: One more Python 2 -> Python 3 fix

https://gerrit.wikimedia.org/r/1043038

Change #1043038 merged by Muehlenhoff:

[operations/puppet@production] udpmxircecho: One more Python 2 -> Python 3 fix

https://gerrit.wikimedia.org/r/1043038

Change #1046659 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] irc.w.o: Add support for Bullseye

https://gerrit.wikimedia.org/r/1046659

Mentioned in SAL (#wikimedia-operations) [2024-06-18T07:56:51Z] <moritzm> uploaded python-irc 8.5.3+dfsg-4+wmf1 to apt.wikimedia.org T331702

Change #1046659 merged by Muehlenhoff:

[operations/puppet@production] irc.w.o: Add support for Bullseye

https://gerrit.wikimedia.org/r/1046659

Change #1047021 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] mw-irc: Fix installation of Prometheus Python client package

https://gerrit.wikimedia.org/r/1047021

Change #1047021 merged by Muehlenhoff:

[operations/puppet@production] mw-irc: Fix installation of Prometheus Python client package

https://gerrit.wikimedia.org/r/1047021

Bullseye-based servers are up and running, one can connect to irc1002.wikimedia.org and irc2002.wikimedia.org the same way as for irc1001/irc2001.

However, when joining the #en.wikipedia channel on irc1001 (current) and irc1002 (new) and comparing the events, there's less output for the latter. So there seems to be some issue with broadcasted mediawiki events (or the way these are ingested on the new installations)

Change #1047430 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/deployment-charts@master] Add new IRC servers also to the k8s hosts

https://gerrit.wikimedia.org/r/1047430

Bullseye-based servers are up and running, one can connect to irc1002.wikimedia.org and irc2002.wikimedia.org the same way as for irc1001/irc2001.

However, when joining the #en.wikipedia channel on irc1001 (current) and irc1002 (new) and comparing the events, there's less output for the latter. So there seems to be some issue with broadcasted mediawiki events (or the way these are ingested on the new installations)

The reason for this turned out to be unrelated to the ircecho changes/new OS: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/905626 added irc1002/irc2002 to mediawiki-config back in April 2023, but that never made it into the deployment-charts/wikikube and consequently there was only the remaining mw-on-baremetal traffic until it dried out fully yesterday evening.

Change #1047430 merged by Muehlenhoff:

[operations/deployment-charts@master] Add new IRC servers also to the k8s hosts

https://gerrit.wikimedia.org/r/1047430

Change #1047851 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/dns@master] Switch irc.wikimedia to one of the new Bullseye hosts

https://gerrit.wikimedia.org/r/1047851

Change #1047851 merged by Muehlenhoff:

[operations/dns@master] Switch irc.wikimedia to one of the new Bullseye hosts

https://gerrit.wikimedia.org/r/1047851

Mentioned in SAL (#wikimedia-operations) [2024-06-20T07:04:00Z] <moritzm> failover irc.wikimedia.org to the new Bullseye servers T331702

Mentioned in SAL (#wikimedia-operations) [2024-06-20T08:08:14Z] <moritzm> reboot of irc1001 to nudge clients to re-connect to the new bullseye host T331702

I've pointed the irc.w.o CNAME to irc1002 and rebooted to irc1001 to force a failover to 1002. All bots which were connected to #de.wikipedia reconnected to the new node.

I'll decommission the old servers on Monday.

Change #1049137 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/deployment-charts@master] irc.wikimedia.org: Stop sending broadcast events to the old buster nodes

https://gerrit.wikimedia.org/r/1049137

Change #1049137 merged by Muehlenhoff:

[operations/deployment-charts@master] irc.wikimedia.org: Stop sending broadcast events to the old buster nodes

https://gerrit.wikimedia.org/r/1049137

Change #1049503 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Switch old irc hosts to insetup::buster

https://gerrit.wikimedia.org/r/1049503

Change #1049503 merged by Muehlenhoff:

[operations/puppet@production] Switch old irc hosts to insetup::buster

https://gerrit.wikimedia.org/r/1049503

cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: irc2001.wikimedia.org

  • irc2001.wikimedia.org (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster codfw to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster codfw to Netbox

Change #1050260 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Remove irc1001/irc2001 from site.pp

https://gerrit.wikimedia.org/r/1050260

Change #1050261 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/mediawiki-config@master] Remove irc1001/irc2001 from mediawiki-config

https://gerrit.wikimedia.org/r/1050261

Change #1050260 merged by Muehlenhoff:

[operations/puppet@production] Remove irc1001/irc2001 from site.pp

https://gerrit.wikimedia.org/r/1050260

cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: irc1001.wikimedia.org

  • irc1001.wikimedia.org (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster eqiad to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster eqiad to Netbox
MoritzMuehlenhoff claimed this task.

The old nodes have been decommissioned, all done.