MoritzMuehlenhoff (Moritz Mühlenhoff)
User

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Tuesday

  • Clear sailing ahead.

User Details

User Since
Apr 1 2015, 4:33 PM (483 w, 3 d)
Availability
Busy Busy until Aug 31.
LDAP User
Moritz Mühlenhoff
MediaWiki User
MMuhlenhoff (WMF) [ Global Accounts ]

Recent Activity

Fri, Jun 28

MoritzMuehlenhoff added a comment to T364416: Q4:rack/setup/install deploy1003.

Let's directly install this server with Puppet 7, there should be no issues in the deployment-server manifests in terms of Puppet 5/7 compat at this point.

Fri, Jun 28, 10:34 AM · SRE, serviceops, ops-eqiad, DC-Ops

Thu, Jun 27

MoritzMuehlenhoff updated the task description for T368288: Integrate Bullseye 11.10 point update.
Thu, Jun 27, 9:29 PM · Infrastructure-Foundations, SRE
MoritzMuehlenhoff added a comment to T343377: Grant slightly broader access to Klaxon.

Anf FYI, https://gerrit.wikimedia.org/g/operations/software/bitu-ldap is a wrapper for simplifying LDAP operations within Wikimedia (originally written for Bitu, but other Python also use it). Should be helpful for writing the dump script.

Thu, Jun 27, 7:25 PM · Stewards-Onboarding-Tool, Sustainability (Incident Followup), Incident Tooling, SRE-OnFire, SRE
MoritzMuehlenhoff added a comment to T354855: ferm sometimes fails to restart on Kubernetes workers via xtables lock held by kube-proxy.

This issue is biting us again, the time between a puppet run that fails to start ferm.service and the subsequent run is enough to cause issues with eventgate, leading to flurries of Can't enqueue job errors in mediawiki. It is in particular triggered when we add new kubernetes nodes, because that causes all other kubernetes nodes in the cluster to update their ferm rules.

Would adding Restart=on-failure to ferm.service for kubernetes worker be a possible short-term solution @MoritzMuehlenhoff ?

Thu, Jun 27, 5:11 PM · Patch-For-Review, Infrastructure-Foundations, serviceops, SRE
MoritzMuehlenhoff added a comment to T343377: Grant slightly broader access to Klaxon.

Using ldap-maint1001 has the benefit that it already does r/w changes to the r/w slapd servers. Currently we don't restrict that, but we've been gradually shifting r/o access only to the replicas and I'd like to come to a state where the only r/w changes to our LDAP are coming from Horizon (for cloud VPS access management), ldap-maint and Bitu and then all other hosts in production get access denied via firewall rules.

Thu, Jun 27, 3:35 PM · Stewards-Onboarding-Tool, Sustainability (Incident Followup), Incident Tooling, SRE-OnFire, SRE
MoritzMuehlenhoff added a comment to T343377: Grant slightly broader access to Klaxon.

One thing that we could do is to

Thu, Jun 27, 3:32 PM · Stewards-Onboarding-Tool, Sustainability (Incident Followup), Incident Tooling, SRE-OnFire, SRE
MoritzMuehlenhoff added a comment to T352245: Migrate etcd::tlsproxy Nginx certs and etcd itself to PKI.

Thanks, @MoritzMuehlenhoff - you're absolutely right that we could decouple these.

The TLS proxy will go away with the v3 migration, since its primary use case will be absorbed into etcd itself (role-based access control). Thus, I think the main question is whether the effort is worth it.

Thu, Jun 27, 9:22 AM · serviceops
MoritzMuehlenhoff claimed T355663: Allocate more available UNIX UIDs for human users.

I'll take care of this when I'm back from sabbatical

Thu, Jun 27, 9:08 AM · User-MoritzMuehlenhoff, Bitu, Infrastructure-Foundations, cloud-services-team, LDAP
MoritzMuehlenhoff added a project to T345070: Attach opencontainers image metadata to docker images: User-MoritzMuehlenhoff.
Thu, Jun 27, 9:07 AM · User-MoritzMuehlenhoff, User-Elukey, Release-Engineering-Team, serviceops, docker-pkg
MoritzMuehlenhoff assigned T368597: Decommission ganeti1019 to Jclark-ctr.
Thu, Jun 27, 9:04 AM · DC-Ops, ops-eqiad, Infrastructure-Foundations, SRE, decommission-hardware
MoritzMuehlenhoff updated the task description for T368597: Decommission ganeti1019.
Thu, Jun 27, 9:04 AM · DC-Ops, ops-eqiad, Infrastructure-Foundations, SRE, decommission-hardware
MoritzMuehlenhoff triaged T368597: Decommission ganeti1019 as Medium priority.
Thu, Jun 27, 8:56 AM · DC-Ops, ops-eqiad, Infrastructure-Foundations, SRE, decommission-hardware
MoritzMuehlenhoff created T368597: Decommission ganeti1019.
Thu, Jun 27, 8:46 AM · DC-Ops, ops-eqiad, Infrastructure-Foundations, SRE, decommission-hardware
MoritzMuehlenhoff closed T331702: Migrate mw_rc_irc servers to Bullseye as Resolved.

The old nodes have been decommissioned, all done.

Thu, Jun 27, 8:39 AM · Patch-For-Review, Wikimedia-IRC-RC-Server, SRE-Unowned, SRE
MoritzMuehlenhoff closed T331702: Migrate mw_rc_irc servers to Bullseye, a subtask of T291916: Tracking task for Bullseye migrations in production, as Resolved.
Thu, Jun 27, 8:38 AM · User-Elukey, Epic, Infrastructure-Foundations, SRE
MoritzMuehlenhoff closed T368503: Update CAS to 6.6.15.2 as Resolved.
Thu, Jun 27, 7:44 AM · CAS-SSO, Infrastructure-Foundations
MoritzMuehlenhoff triaged T368503: Update CAS to 6.6.15.2 as High priority.
Thu, Jun 27, 7:39 AM · CAS-SSO, Infrastructure-Foundations
MoritzMuehlenhoff updated the task description for T365799: Revert back to fleet-wide acmechief config once all ACME consumers are on Puppet 7.
Thu, Jun 27, 7:32 AM · Patch-For-Review, Traffic, Acme-chief, Puppet-Infrastructure, SRE, Infrastructure-Foundations
MoritzMuehlenhoff added a comment to T332016: Migrate docker registry hosts to bookworm.

Why bullseye, this should be bookworm? docker-registry is packaged in Debian, so we can simply use bookworm and use the package from it. In fact, we are already using the bookworm package on the existing registry hosts (2.8.2+ds1-1)

Thu, Jun 27, 7:24 AM · serviceops

Wed, Jun 26

MoritzMuehlenhoff updated the task description for T368288: Integrate Bullseye 11.10 point update.
Wed, Jun 26, 2:57 PM · Infrastructure-Foundations, SRE
MoritzMuehlenhoff renamed T367981: Update Proton to include Chromium 126.0.6478.126 from Update Proton to include Chromium 126.0.6478.114 to Update Proton to include Chromium 126.0.6478.126.
Wed, Jun 26, 11:21 AM · Content-Transform-Team-WIP, Essential-Work, Proton
MoritzMuehlenhoff added a comment to T367981: Update Proton to include Chromium 126.0.6478.126.

New release:
https://lists.debian.org/debian-security-announce/2024/msg00131.html

Wed, Jun 26, 11:20 AM · Content-Transform-Team-WIP, Essential-Work, Proton

Tue, Jun 25

MoritzMuehlenhoff added a comment to T368088: upgrade prometheus-ipmi-exporter to 1.8.0.

Indeed, CGO_ENABLED=0 rings a bell.

Tue, Jun 25, 3:49 PM · Patch-For-Review, SRE Observability, Packaging, Infrastructure-Foundations
MoritzMuehlenhoff updated the task description for T365165: Q4:rack/setup/install krb1002.
Tue, Jun 25, 2:39 PM · SRE, Infrastructure-Foundations, ops-eqiad, DC-Ops
MoritzMuehlenhoff added a comment to T365165: Q4:rack/setup/install krb1002.

@MoritzMuehlenhoff would you be able to update site.pp file for this server?

Tue, Jun 25, 2:39 PM · SRE, Infrastructure-Foundations, ops-eqiad, DC-Ops
MoritzMuehlenhoff updated the task description for T365799: Revert back to fleet-wide acmechief config once all ACME consumers are on Puppet 7.
Tue, Jun 25, 1:23 PM · Patch-For-Review, Traffic, Acme-chief, Puppet-Infrastructure, SRE, Infrastructure-Foundations
MoritzMuehlenhoff added a comment to T368088: upgrade prometheus-ipmi-exporter to 1.8.0.

The dependency is added because some feature in the compiled Go code uses syscalls which were only wired up in 2.34 (maybe openat() at al). We ran into this problem before and there was a Go build flag to force it to use a fallback. I can't find a reference currently, but maybe Filippo remembers when he's back.

Tue, Jun 25, 10:56 AM · Patch-For-Review, SRE Observability, Packaging, Infrastructure-Foundations
MoritzMuehlenhoff updated the task description for T349619: Migrate roles to puppet7.
Tue, Jun 25, 10:48 AM · Patch-For-Review, Data-Platform-SRE (2024.06.17 - 2024.07.07), serviceops, collaboration-services, SRE-tools, Puppet-Core, Puppet (Puppet 7.0), Infrastructure-Foundations, SRE
MoritzMuehlenhoff closed T367554: Cloud VPS "sso" project Buster deprecation as Resolved.

The Buster instances have been removed.

Tue, Jun 25, 8:43 AM · Cloud-VPS (Debian Buster Deprecation)
MoritzMuehlenhoff updated the task description for T367554: Cloud VPS "sso" project Buster deprecation.
Tue, Jun 25, 8:41 AM · Cloud-VPS (Debian Buster Deprecation)
MoritzMuehlenhoff added a comment to T367554: Cloud VPS "sso" project Buster deprecation.

hi all i wanted to say that the sso project is used so that users have an SSO testing infrastructure to use in cloud services. Originally this was also used to provide sso to production like services in cloud services, however this later functionality has been moved.

If there is still a desire to keep a development environment then we will still need all theses machines

  • puppetprimary.sso.eqiad1.wikimedia.cloud: The project uses its own puppet master as we have secrets
  • sso-db.sso.eqiad1.wikimedia.cloud: This is a mysql db used to store e.g. mfa keys
  • sso-pdb.sso.eqiad1.wikimedia.cloud: A puppet db instance, not sure why this is used but guessing the idp classes somehow need some puppetdb functionality.
Tue, Jun 25, 8:26 AM · Cloud-VPS (Debian Buster Deprecation)

Mon, Jun 24

MoritzMuehlenhoff triaged T368288: Integrate Bullseye 11.10 point update as Medium priority.
Mon, Jun 24, 7:11 PM · Infrastructure-Foundations, SRE
MoritzMuehlenhoff created T368288: Integrate Bullseye 11.10 point update.
Mon, Jun 24, 3:39 PM · Infrastructure-Foundations, SRE
MoritzMuehlenhoff added a comment to T367757: Request to add mnz to analytics-research-admins.

Then the next steps will be creating an SSH key and signing the access agreement. Here are the details:

Mon, Jun 24, 3:36 PM · Patch-For-Review, SRE, SRE-Access-Requests
MoritzMuehlenhoff added a comment to T367757: Request to add mnz to analytics-research-admins.

@KFrancis can you please make sure @MunizaA's NDA is signed? Thank you!

Mon, Jun 24, 3:02 PM · Patch-For-Review, SRE, SRE-Access-Requests
MoritzMuehlenhoff added a project to T368088: upgrade prometheus-ipmi-exporter to 1.8.0: SRE Observability.
Mon, Jun 24, 2:30 PM · Patch-For-Review, SRE Observability, Packaging, Infrastructure-Foundations
MoritzMuehlenhoff triaged T368088: upgrade prometheus-ipmi-exporter to 1.8.0 as Medium priority.

Given that this is a Go static ELF we can also simply build on bookworm and copy over the deb to bullseye-wikimedia, we're doing this for other exporters as well. buster might be tricky due to it's old libc6, but we can also ignore it, there's less than 150 hosts left and they can simply live the old IPMI monitoring.

Mon, Jun 24, 2:30 PM · Patch-For-Review, SRE Observability, Packaging, Infrastructure-Foundations
MoritzMuehlenhoff triaged T368023: Move the private Puppet repository to puppetserver1001 as High priority.
Mon, Jun 24, 2:11 PM · Patch-For-Review, User-Elukey, Puppet-Infrastructure, SRE, Infrastructure-Foundations
MoritzMuehlenhoff triaged T367861: Migrate ldap-ro and ldap-ro-ssl to IPIP encapsulation as Medium priority.
Mon, Jun 24, 1:03 PM · Infrastructure-Foundations, Traffic
MoritzMuehlenhoff triaged T367487: Update CAS to 7.0 as Medium priority.
Mon, Jun 24, 1:03 PM · Patch-For-Review, CAS-SSO, Infrastructure-Foundations, SRE
MoritzMuehlenhoff added a comment to T350567: Migrate Cassandra to Java 11.

Very nice!

Mon, Jun 24, 11:53 AM · Cassandra, Data-Persistence, SRE
MoritzMuehlenhoff updated the task description for T273950: Modernise memcached systemd unit / sync, and make it presentable.
Mon, Jun 24, 11:25 AM · Cloud-Services, serviceops, User-jijiki, SRE
MoritzMuehlenhoff added a comment to T273950: Modernise memcached systemd unit / sync, and make it presentable.

CAS 7.0 (what we are currently migrating to) removed the memcached backend. As such, this change won't be needed anymore for the idp servers, I'll tick them off.

Mon, Jun 24, 11:25 AM · Cloud-Services, serviceops, User-jijiki, SRE

Thu, Jun 20

MoritzMuehlenhoff added a project to T310087: Advance declaration of query parameters: User-MoritzMuehlenhoff.
Thu, Jun 20, 2:05 PM · User-MoritzMuehlenhoff, SRE, Traffic, MediaWiki-General
MoritzMuehlenhoff added a comment to T367399: Default to the Puppet 7 PCC CI test, make it voting and eventually remove the Puppet 5 one.

Did one of these changes possinbly break PCC here?
https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler-test/3739/console

Thu, Jun 20, 12:50 PM · Patch-For-Review, Puppet-Infrastructure, SRE, Infrastructure-Foundations
MoritzMuehlenhoff added a comment to T368023: Move the private Puppet repository to puppetserver1001.

And prior to the migration, puppetserver1001 needs to be allowed in profile::tcpircbot

Thu, Jun 20, 11:54 AM · Patch-For-Review, User-Elukey, Puppet-Infrastructure, SRE, Infrastructure-Foundations
MoritzMuehlenhoff updated the task description for T349619: Migrate roles to puppet7.
Thu, Jun 20, 11:04 AM · Patch-For-Review, Data-Platform-SRE (2024.06.17 - 2024.07.07), serviceops, collaboration-services, SRE-tools, Puppet-Core, Puppet (Puppet 7.0), Infrastructure-Foundations, SRE
MoritzMuehlenhoff added a comment to T352647: Move Cassandra clusters to PKI.

The task can be closed, or is there anything still open?

Thu, Jun 20, 10:04 AM · Patch-For-Review, Data-Persistence, Cassandra
MoritzMuehlenhoff added a comment to T368023: Move the private Puppet repository to puppetserver1001.

After the change gets made, kerberos_kadmin_keytabs_repo in hieradata/common.yaml needs to be adapted to point to puppetserver1001

Thu, Jun 20, 9:43 AM · Patch-For-Review, User-Elukey, Puppet-Infrastructure, SRE, Infrastructure-Foundations
MoritzMuehlenhoff created T368023: Move the private Puppet repository to puppetserver1001.
Thu, Jun 20, 8:47 AM · Patch-For-Review, User-Elukey, Puppet-Infrastructure, SRE, Infrastructure-Foundations
MoritzMuehlenhoff added a comment to T331702: Migrate mw_rc_irc servers to Bullseye.

I've pointed the irc.w.o CNAME to irc1002 and rebooted to irc1001 to force a failover to 1002. All bots which were connected to #de.wikipedia reconnected to the new node.

Thu, Jun 20, 8:15 AM · Patch-For-Review, Wikimedia-IRC-RC-Server, SRE-Unowned, SRE
MoritzMuehlenhoff added a comment to T367981: Update Proton to include Chromium 126.0.6478.126.

New release:
https://lists.debian.org/debian-security-announce/2024/msg00126.html

Thu, Jun 20, 6:39 AM · Content-Transform-Team-WIP, Essential-Work, Proton
MoritzMuehlenhoff renamed T367981: Update Proton to include Chromium 126.0.6478.126 from Update Proton to include Chromium 126.0.6478.56-1 to Update Proton to include Chromium 126.0.6478.114.
Thu, Jun 20, 6:39 AM · Content-Transform-Team-WIP, Essential-Work, Proton

Wed, Jun 19

MoritzMuehlenhoff created T367981: Update Proton to include Chromium 126.0.6478.126.
Wed, Jun 19, 2:20 PM · Content-Transform-Team-WIP, Essential-Work, Proton
MoritzMuehlenhoff triaged T367970: Update pxelinux in tftpboot environment as Medium priority.
Wed, Jun 19, 12:44 PM · User-Elukey, SRE, Infrastructure-Foundations
MoritzMuehlenhoff created T367970: Update pxelinux in tftpboot environment.
Wed, Jun 19, 12:44 PM · User-Elukey, SRE, Infrastructure-Foundations
MoritzMuehlenhoff updated the task description for T365799: Revert back to fleet-wide acmechief config once all ACME consumers are on Puppet 7.
Wed, Jun 19, 8:36 AM · Patch-For-Review, Traffic, Acme-chief, Puppet-Infrastructure, SRE, Infrastructure-Foundations
MoritzMuehlenhoff added a comment to T331702: Migrate mw_rc_irc servers to Bullseye.

Bullseye-based servers are up and running, one can connect to irc1002.wikimedia.org and irc2002.wikimedia.org the same way as for irc1001/irc2001.

However, when joining the #en.wikipedia channel on irc1001 (current) and irc1002 (new) and comparing the events, there's less output for the latter. So there seems to be some issue with broadcasted mediawiki events (or the way these are ingested on the new installations)

Wed, Jun 19, 7:57 AM · Patch-For-Review, Wikimedia-IRC-RC-Server, SRE-Unowned, SRE

Tue, Jun 18

Dzahn awarded T331706: Migrate Mailman/lists to Bullseye/Bookworm a Orange Medal token.
Tue, Jun 18, 3:29 PM · Patch-For-Review, collaboration-services, Wikimedia-Mailing-lists, SRE
MoritzMuehlenhoff updated the task description for T367544: Cloud VPS "packaging" project Buster deprecation.
Tue, Jun 18, 11:17 AM · collaboration-services, Cloud-VPS (Debian Buster Deprecation)
MoritzMuehlenhoff added a comment to T331702: Migrate mw_rc_irc servers to Bullseye.

Bullseye-based servers are up and running, one can connect to irc1002.wikimedia.org and irc2002.wikimedia.org the same way as for irc1001/irc2001.

Tue, Jun 18, 10:16 AM · Patch-For-Review, Wikimedia-IRC-RC-Server, SRE-Unowned, SRE
MoritzMuehlenhoff added a comment to T348730: DRBD kernel error on ganeti2031 led to kernel hang.

Happened once more on ganeti2029 today. We're gradually moving nodes to Bookworm (the routed cluster and magru cluster are already running it and the next refreshes in codfw/eqiad will also immediatelly be added with Bookworm), so hopefully the more recent kernel/DRBD addresses this bug.

Tue, Jun 18, 9:41 AM · Infrastructure-Foundations, SRE
MoritzMuehlenhoff added a comment to T367071: ganeti1019 is down.

@MoritzMuehlenhoff after replacing failed drive looked like it might boot but still fails. Might need to be reimaged I do not have root access so unable to proceed passed this

You are in emergency mode. After logging in, type "journalctl -xGive root passwe
(or press Control-D to continue):

Tue, Jun 18, 7:10 AM · DC-Ops, ops-eqiad, SRE, Infrastructure-Foundations, Ganeti

Mon, Jun 17

MoritzMuehlenhoff added a comment to T366766: Phabricator profile "LDAP User" link goes to Wikitech, but users no longer need to have a Wikitech account.

What to do? Remove the linking from the "LDAP User" entry on a Phab user profile and make that plain text instead?

Mon, Jun 17, 3:13 PM · Phabricator (2024-06-25), Wikimedia-Phabricator-Extensions, Infrastructure-Foundations
MoritzMuehlenhoff added a comment to T367119: Install a default timeout for systemd::timer::jobs.

One other option: Add a separate wrapper define systemd::timer::job_capped which has the timeout as a mandatory argument (but without a default). And then reach out to SRE teams to migrate the jobs based what the respective uses cases needed.

Mon, Jun 17, 2:57 PM · Infrastructure-Foundations, Puppet
MoritzMuehlenhoff triaged T367287: Update Wikitech's LDAP credentials to be read-only as Medium priority.
Mon, Jun 17, 2:06 PM · Patch-For-Review, Infrastructure-Foundations, cloud-services-team, LDAP, wikitech.wikimedia.org
MoritzMuehlenhoff added a comment to T365074: Requesting access to cassandra-staging-devs for milimetric.

@Milimetric also, you provided an SSH key fingerprint, we need the public key. It should start with something like ssh-rsa AAAA...

Mon, Jun 17, 12:38 PM · SRE, SRE-Access-Requests
MoritzMuehlenhoff updated the task description for T349619: Migrate roles to puppet7.
Mon, Jun 17, 12:06 PM · Patch-For-Review, Data-Platform-SRE (2024.06.17 - 2024.07.07), serviceops, collaboration-services, SRE-tools, Puppet-Core, Puppet (Puppet 7.0), Infrastructure-Foundations, SRE
MoritzMuehlenhoff added a comment to T367544: Cloud VPS "packaging" project Buster deprecation.

packager02.packaging.eqiad1.wikimedia.cloud

According to the etherpad upgrade docs this host is used to build the etherpad Debian package. I also used the host in the past to build the etherpad package. The dedicated host is used because "etherpad builds fetches npm modules during the build time".

Mon, Jun 17, 10:11 AM · collaboration-services, Cloud-VPS (Debian Buster Deprecation)
MoritzMuehlenhoff edited projects for T367544: Cloud VPS "packaging" project Buster deprecation, added: collaboration-services; removed Infrastructure-Foundations.
Mon, Jun 17, 10:09 AM · collaboration-services, Cloud-VPS (Debian Buster Deprecation)
MoritzMuehlenhoff added a comment to T367487: Update CAS to 7.0.

I've run a test build, Java 21 is a hard requirement, it cannot be older or newer.
Otherwise the overlay upgrade contains only minor changes. I have not tested the functionality.

Mon, Jun 17, 9:31 AM · Patch-For-Review, CAS-SSO, Infrastructure-Foundations, SRE
MoritzMuehlenhoff closed T367490: Split out ldap management from mwmaint as Resolved.

The LDAP management parts have been split off to the new ldap-maint1001/ldap-maint2001 hosts.

Mon, Jun 17, 8:50 AM · LDAP, Infrastructure-Foundations, SRE

Fri, Jun 14

MoritzMuehlenhoff added a comment to T367487: Update CAS to 7.0.

I'll look into a Java 21 backport for Bookworm.

Fri, Jun 14, 12:14 PM · Patch-For-Review, CAS-SSO, Infrastructure-Foundations, SRE
MoritzMuehlenhoff added a comment to T352245: Migrate etcd::tlsproxy Nginx certs and etcd itself to PKI.

It seems there's two parts to this migration: The etcd internal cert which will move to PKI along with the v3 migration, but there's also the cert used for the TLS termination. For the latter John made a patch back in 2022 to move it to the discovery PKI cert (https://gerrit.wikimedia.org/r/c/operations/puppet/+/790657) and it seems to me that part could already be addressed independently by rebasing/adapting/merging John's patch? (Given this is a core service one could also adapt the patch and introduce an additional Hiera flag to allow to initially only move one of the conf* nodes instead of the full cluster)

Fri, Jun 14, 11:27 AM · serviceops
MoritzMuehlenhoff closed T366695: Move the ping* servers to Bookworm as Resolved.

The old ping servers have been decommed, closing.

Fri, Jun 14, 7:59 AM · Patch-For-Review, Infrastructure-Foundations, SRE
MoritzMuehlenhoff created T367490: Split out ldap management from mwmaint.
Fri, Jun 14, 7:51 AM · LDAP, Infrastructure-Foundations, SRE
MoritzMuehlenhoff created T367487: Update CAS to 7.0.
Fri, Jun 14, 6:40 AM · Patch-For-Review, CAS-SSO, Infrastructure-Foundations, SRE

Thu, Jun 13

MoritzMuehlenhoff created T367399: Default to the Puppet 7 PCC CI test, make it voting and eventually remove the Puppet 5 one.
Thu, Jun 13, 11:51 AM · Patch-For-Review, Puppet-Infrastructure, SRE, Infrastructure-Foundations
MoritzMuehlenhoff updated the task description for T360636: Phase out cergen for ServiceOps services.
Thu, Jun 13, 10:32 AM · Patch-For-Review, serviceops, Epic, SRE
MoritzMuehlenhoff updated the task description for T360636: Phase out cergen for ServiceOps services.
Thu, Jun 13, 10:32 AM · Patch-For-Review, serviceops, Epic, SRE
MoritzMuehlenhoff updated the task description for T357750: Phase out cergen.
Thu, Jun 13, 10:31 AM · Patch-For-Review, Puppet-Infrastructure, Puppet (Puppet 7.0), Infrastructure-Foundations, SRE
MoritzMuehlenhoff closed T356174: Connection errors to some hosts from cumin1002 as Resolved.

We can close this, the new established procedure is that all servers which get moved from ferm to nftables are rebooted.

Thu, Jun 13, 7:36 AM · Patch-For-Review, Infrastructure-Foundations, SRE

Wed, Jun 12

MoritzMuehlenhoff added a comment to T366695: Move the ping* servers to Bookworm.

The routers in eqiad have been reconfigured to use ping1004 (confirmed with tcpdump) instead of ping1003. I'll decom the old nodes on Friday.

Wed, Jun 12, 3:08 PM · Patch-For-Review, Infrastructure-Foundations, SRE
MoritzMuehlenhoff added a comment to T350567: Migrate Cassandra to Java 11.

Can you please uninstall openjdk-8-* on the migrated clusters? (simply run dpkg --remove openjdk-8-jdk openjre-8-jre openjdk-8-jdk-headless openjre-8-jre-headless via Cumin). This avoids confusion the next time we roll out new Java security updates.

Wed, Jun 12, 1:18 PM · Cassandra, Data-Persistence, SRE
MoritzMuehlenhoff added a comment to T367071: ganeti1019 is down.

@MoritzMuehlenhoff after replacing failed drive looked like it might boot but still fails. Might need to be reimaged I do not have root access so unable to proceed passed this

Wed, Jun 12, 6:53 AM · DC-Ops, ops-eqiad, SRE, Infrastructure-Foundations, Ganeti

Tue, Jun 11

MoritzMuehlenhoff added a comment to T367071: ganeti1019 is down.

@MoritzMuehlenhoff Can i take server down to replace dimm?

Tue, Jun 11, 6:26 PM · DC-Ops, ops-eqiad, SRE, Infrastructure-Foundations, Ganeti
MoritzMuehlenhoff added a comment to T366695: Move the ping* servers to Bookworm.

The routers in codfw have been reconfigured to use ping2004 (confirmed with tcpdump) instead of ping2003.

Tue, Jun 11, 2:56 PM · Patch-For-Review, Infrastructure-Foundations, SRE

Mon, Jun 10

MoritzMuehlenhoff added a project to T367071: ganeti1019 is down: ops-eqiad.
Mon, Jun 10, 6:02 PM · DC-Ops, ops-eqiad, SRE, Infrastructure-Foundations, Ganeti
MoritzMuehlenhoff assigned T367071: ganeti1019 is down to Jclark-ctr.

All VMs moved off the server. DC ops, can you please have a look? Not sure what "unsupported event" means, never seen that before? Server can be powered off for analysis any time.

Mon, Jun 10, 6:02 PM · DC-Ops, ops-eqiad, SRE, Infrastructure-Foundations, Ganeti
MoritzMuehlenhoff triaged T367071: ganeti1019 is down as Medium priority.
Mon, Jun 10, 3:24 PM · DC-Ops, ops-eqiad, SRE, Infrastructure-Foundations, Ganeti
MoritzMuehlenhoff created T367071: ganeti1019 is down.
Mon, Jun 10, 3:24 PM · DC-Ops, ops-eqiad, SRE, Infrastructure-Foundations, Ganeti
MoritzMuehlenhoff added a comment to T366797: Add option to exclude nodes from reboot by uptime or last reboot date.

I think we should rather base this on a given kernel version? Seems more robust than a given date.

Mon, Jun 10, 2:09 PM · User-Elukey, Infrastructure-Foundations, SRE-tools
MoritzMuehlenhoff closed T366465: Extend puppet ipresolve() to support SRV records as Declined.

Given an alternative solution was found for etcd, closing this one (after checking with Janis)

Mon, Jun 10, 2:07 PM · Puppet-Core, serviceops, Infrastructure-Foundations

Fri, Jun 7

MoritzMuehlenhoff triaged T366900: Test Puppet 8 readiness as Medium priority.
Fri, Jun 7, 12:13 PM · Infrastructure-Foundations, SRE, Puppet-Infrastructure
MoritzMuehlenhoff created T366900: Test Puppet 8 readiness.
Fri, Jun 7, 12:12 PM · Infrastructure-Foundations, SRE, Puppet-Infrastructure
MoritzMuehlenhoff closed T343515: Migrate bastions to Bookworm as Resolved.

All bastions are on bookworm now

Fri, Jun 7, 9:39 AM · Infrastructure-Foundations, SRE
MoritzMuehlenhoff updated the task description for T343515: Migrate bastions to Bookworm.
Fri, Jun 7, 9:38 AM · Infrastructure-Foundations, SRE

Jun 5 2024

MoritzMuehlenhoff updated the task description for T273950: Modernise memcached systemd unit / sync, and make it presentable.
Jun 5 2024, 2:33 PM · Cloud-Services, serviceops, User-jijiki, SRE
MoritzMuehlenhoff updated the task description for T349619: Migrate roles to puppet7.
Jun 5 2024, 12:12 PM · Patch-For-Review, Data-Platform-SRE (2024.06.17 - 2024.07.07), serviceops, collaboration-services, SRE-tools, Puppet-Core, Puppet (Puppet 7.0), Infrastructure-Foundations, SRE
MoritzMuehlenhoff triaged T366695: Move the ping* servers to Bookworm as Medium priority.
Jun 5 2024, 11:49 AM · Patch-For-Review, Infrastructure-Foundations, SRE