They currently have fairly small disks,which led to issues in the past. Instead of reimaging them in place, I'll create new VMs with more disk space alongside.
Description
Details
Event Timeline
Change #1039199 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):
[operations/puppet@production] Add new ping servers to site.pp
Change #1039199 merged by Muehlenhoff:
[operations/puppet@production] Add new ping servers to site.pp
Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ping2004.codfw.wmnet with OS bookworm
Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ping2004.codfw.wmnet with OS bookworm completed:
- ping2004 (PASS)
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via gnt-instance
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Set boot media to disk
- Host up (new fresh bookworm OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202406051236_jmm_2441486_ping2004.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB
Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ping1004.eqiad.wmnet with OS bookworm
Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ping1004.eqiad.wmnet with OS bookworm executed with errors:
- ping1004 (FAIL)
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via gnt-instance
- The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" ping1004.eqiad.wmnet to get a root shellbut depending on the failure this may not work.
Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ping1004.eqiad.wmnet with OS bookworm
Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ping1004.eqiad.wmnet with OS bookworm completed:
- ping1004 (PASS)
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via gnt-instance
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Set boot media to disk
- Host up (new fresh bookworm OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202406100821_jmm_788982_ping1004.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB
Change #1041030 merged by Muehlenhoff:
[operations/homer/public@master] Change ping host in codfw to ping2004
The routers in codfw have been reconfigured to use ping2004 (confirmed with tcpdump) instead of ping2003.
Change #1041687 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):
[operations/homer/public@master] Change ping host in codfw to ping1004
Change #1041687 merged by Muehlenhoff:
[operations/homer/public@master] Change ping host in codfw to ping1004
The routers in eqiad have been reconfigured to use ping1004 (confirmed with tcpdump) instead of ping1003. I'll decom the old nodes on Friday.
cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: ping2003.codfw.wmnet
- ping2003.codfw.wmnet (PASS)
- Downtimed host on Icinga/Alertmanager
- Found Ganeti VM
- VM shutdown
- Started forced sync of VMs in Ganeti cluster codfw to Netbox
- Removed from DebMonitor
- Removed from Puppet master and PuppetDB
- VM removed
- Started forced sync of VMs in Ganeti cluster codfw to Netbox
Change #1043597 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):
[operations/puppet@production] Remove old ping hosts from site.pp
Change #1043597 merged by Muehlenhoff:
[operations/puppet@production] Remove old ping hosts from site.pp
cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: ping1003.eqiad.wmnet
- ping1003.eqiad.wmnet (PASS)
- Downtimed host on Icinga/Alertmanager
- Found Ganeti VM
- VM shutdown
- Started forced sync of VMs in Ganeti cluster eqiad to Netbox
- Removed from DebMonitor
- Removed from Puppet master and PuppetDB
- VM removed
- Started forced sync of VMs in Ganeti cluster eqiad to Netbox