This tag is a catch-all for Puppet tasks that don't align with Puppet-Core, Puppet-Infrastructure, or Puppet CI. It is not specifically assigned to any team and can be used by any team to conveniently tag Puppet-related tasks.
See also:
This tag is a catch-all for Puppet tasks that don't align with Puppet-Core, Puppet-Infrastructure, or Puppet CI. It is not specifically assigned to any team and can be used by any team to conveniently tag Puppet-related tasks.
See also:
I got another error at backup2002 (es5):
2024-06-26 17:07:31 [ERROR] - Could not read data from enwiki.blobs_cluster27: Lost connection to MySQL server during query
This seems to not be reproducible, maybe it was related to cold caches after reboot? Lowering the priority as not happening again since, but wanting to trace it at some point.
How about adding a MAILTO to the timer and mail a specific list / team / group? I think that alerting via IRC is becoming less reliable and direct email would be more effective. (or even automatic ticket creation)
I think the last step to do here is to validate that any rsync failures will get reported on IRC. Then we can consider all the immediate followups of this incident done, and more slowly continue on with the larger work at T367119: Install a default timeout for systemd::timer::jobs.
One other option: Add a separate wrapper define systemd::timer::job_capped which has the timeout as a mandatory argument (but without a default). And then reach out to SRE teams to migrate the jobs based what the respective uses cases needed.
Alternatives to consider:
Change #1041760 merged by CDanis:
[operations/puppet@production] puppetserver syncs: also add monitoring + timeout
Change #1041760 had a related patch set uploaded (by CDanis; author: CDanis):
[operations/puppet@production] puppetserver syncs: also add monitoring + timeout
Change #1041217 merged by CDanis:
[operations/puppet@production] enable monitoring+logging for puppetmaster syncs
Change #1041217 had a related patch set uploaded (by CDanis; author: CDanis):
[operations/puppet@production] enable monitoring+logging for puppetmaster syncs
Perfect, @Lucas_Werkmeister_WMDE! Glad to have this all cleared up :)
I think we can resolve both.
Moving this to verification given the work in T364965. Thanks for all of this, @Lucas_Werkmeister_WMDE! Maybe we can resolve this and leave T364965 until stat1007 is deprecated, or resolve both?
In T351072#9817102, @AndrewTavis_WMDE wrote:So basically removing the wdcm.pp related file on GitHub and its Puppet workflows will close both tasks :)
So basically removing the wdcm.pp related file on GitHub and its Puppet workflows will close both tasks :)
Ah looking at this, I'm realizing I restated myself a bit as the work that's left in T364965: stat1007 to stat1011 migration pipeline output check is a duplicate of what's left to do here :) Each task had some other work that was different, but that work is now done for both.
Hey @Arian_Bozorg 👋 Yes, we do still need to check this out. I was thinking that @Lucas_Werkmeister_WMDE and I could discuss this when we chat about what else is needed in T364965: stat1007 to stat1011 migration pipeline output check. In that one we've confirmed now that the data is coming in from stat1011, so at this point it'd be good to delete the statistics/manifests/wmde/wdcm.pp and also remove it's workflow from Puppet (just not quite sure if I have access and how to go about the Puppet work).
Is this still still required here, trying to find a good spot for this task on the board during triage
Change #1021875 merged by Muehlenhoff:
[operations/puppet@production] Remove obsolete script to detect ever-changing puppet runs
Change #1021875 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):
[operations/puppet@production] Remove obsolete script to detect ever-changing puppet runs