Move the ping* servers to Bookworm
Closed, ResolvedPublic

Description

They currently have fairly small disks,which led to issues in the past. Instead of reimaging them in place, I'll create new VMs with more disk space alongside.

Event Timeline

Change #1039199 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Add new ping servers to site.pp

https://gerrit.wikimedia.org/r/1039199

Change #1039199 merged by Muehlenhoff:

[operations/puppet@production] Add new ping servers to site.pp

https://gerrit.wikimedia.org/r/1039199

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ping2004.codfw.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ping2004.codfw.wmnet with OS bookworm completed:

  • ping2004 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Set boot media to disk
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202406051236_jmm_2441486_ping2004.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ping1004.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ping1004.eqiad.wmnet with OS bookworm executed with errors:

  • ping1004 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" ping1004.eqiad.wmnet to get a root shellbut depending on the failure this may not work.

Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ping1004.eqiad.wmnet with OS bookworm

Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ping1004.eqiad.wmnet with OS bookworm completed:

  • ping1004 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via gnt-instance
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Set boot media to disk
    • Host up (new fresh bookworm OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202406100821_jmm_788982_ping1004.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Change #1041030 merged by Muehlenhoff:

[operations/homer/public@master] Change ping host in codfw to ping2004

https://gerrit.wikimedia.org/r/1041030

The routers in codfw have been reconfigured to use ping2004 (confirmed with tcpdump) instead of ping2003.

Change #1041687 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/homer/public@master] Change ping host in codfw to ping1004

https://gerrit.wikimedia.org/r/1041687

Change #1041687 merged by Muehlenhoff:

[operations/homer/public@master] Change ping host in codfw to ping1004

https://gerrit.wikimedia.org/r/1041687

The routers in eqiad have been reconfigured to use ping1004 (confirmed with tcpdump) instead of ping1003. I'll decom the old nodes on Friday.

cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: ping2003.codfw.wmnet

  • ping2003.codfw.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster codfw to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster codfw to Netbox

Change #1043597 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] Remove old ping hosts from site.pp

https://gerrit.wikimedia.org/r/1043597

Change #1043597 merged by Muehlenhoff:

[operations/puppet@production] Remove old ping hosts from site.pp

https://gerrit.wikimedia.org/r/1043597

cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: ping1003.eqiad.wmnet

  • ping1003.eqiad.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster eqiad to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster eqiad to Netbox

The old ping servers have been decommed, closing.