User Details
- User Since
- Apr 1 2015, 4:33 PM (483 w, 3 d)
- Availability
- Busy Busy until Aug 31.
- LDAP User
- Moritz Mühlenhoff
- MediaWiki User
- MMuhlenhoff (WMF) [ Global Accounts ]
Fri, Jun 28
Let's directly install this server with Puppet 7, there should be no issues in the deployment-server manifests in terms of Puppet 5/7 compat at this point.
Thu, Jun 27
Anf FYI, https://gerrit.wikimedia.org/g/operations/software/bitu-ldap is a wrapper for simplifying LDAP operations within Wikimedia (originally written for Bitu, but other Python also use it). Should be helpful for writing the dump script.
Using ldap-maint1001 has the benefit that it already does r/w changes to the r/w slapd servers. Currently we don't restrict that, but we've been gradually shifting r/o access only to the replicas and I'd like to come to a state where the only r/w changes to our LDAP are coming from Horizon (for cloud VPS access management), ldap-maint and Bitu and then all other hosts in production get access denied via firewall rules.
One thing that we could do is to
I'll take care of this when I'm back from sabbatical
The old nodes have been decommissioned, all done.
Why bullseye, this should be bookworm? docker-registry is packaged in Debian, so we can simply use bookworm and use the package from it. In fact, we are already using the bookworm package on the existing registry hosts (2.8.2+ds1-1)
Wed, Jun 26
Tue, Jun 25
Indeed, CGO_ENABLED=0 rings a bell.
The dependency is added because some feature in the compiled Go code uses syscalls which were only wired up in 2.34 (maybe openat() at al). We ran into this problem before and there was a Go build flag to force it to use a fallback. I can't find a reference currently, but maybe Filippo remembers when he's back.
The Buster instances have been removed.
Mon, Jun 24
Given that this is a Go static ELF we can also simply build on bookworm and copy over the deb to bullseye-wikimedia, we're doing this for other exporters as well. buster might be tricky due to it's old libc6, but we can also ignore it, there's less than 150 hosts left and they can simply live the old IPMI monitoring.
Very nice!
CAS 7.0 (what we are currently migrating to) removed the memcached backend. As such, this change won't be needed anymore for the idp servers, I'll tick them off.
Thu, Jun 20
Did one of these changes possinbly break PCC here?
https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler-test/3739/console
And prior to the migration, puppetserver1001 needs to be allowed in profile::tcpircbot
The task can be closed, or is there anything still open?
After the change gets made, kerberos_kadmin_keytabs_repo in hieradata/common.yaml needs to be adapted to point to puppetserver1001
I've pointed the irc.w.o CNAME to irc1002 and rebooted to irc1001 to force a failover to 1002. All bots which were connected to #de.wikipedia reconnected to the new node.
Wed, Jun 19
Tue, Jun 18
Bullseye-based servers are up and running, one can connect to irc1002.wikimedia.org and irc2002.wikimedia.org the same way as for irc1001/irc2001.
Happened once more on ganeti2029 today. We're gradually moving nodes to Bookworm (the routed cluster and magru cluster are already running it and the next refreshes in codfw/eqiad will also immediatelly be added with Bookworm), so hopefully the more recent kernel/DRBD addresses this bug.
Mon, Jun 17
One other option: Add a separate wrapper define systemd::timer::job_capped which has the timeout as a mandatory argument (but without a default). And then reach out to SRE teams to migrate the jobs based what the respective uses cases needed.
The LDAP management parts have been split off to the new ldap-maint1001/ldap-maint2001 hosts.
Fri, Jun 14
I'll look into a Java 21 backport for Bookworm.
It seems there's two parts to this migration: The etcd internal cert which will move to PKI along with the v3 migration, but there's also the cert used for the TLS termination. For the latter John made a patch back in 2022 to move it to the discovery PKI cert (https://gerrit.wikimedia.org/r/c/operations/puppet/+/790657) and it seems to me that part could already be addressed independently by rebasing/adapting/merging John's patch? (Given this is a core service one could also adapt the patch and introduce an additional Hiera flag to allow to initially only move one of the conf* nodes instead of the full cluster)
The old ping servers have been decommed, closing.
Thu, Jun 13
We can close this, the new established procedure is that all servers which get moved from ferm to nftables are rebooted.
Wed, Jun 12
The routers in eqiad have been reconfigured to use ping1004 (confirmed with tcpdump) instead of ping1003. I'll decom the old nodes on Friday.
Can you please uninstall openjdk-8-* on the migrated clusters? (simply run dpkg --remove openjdk-8-jdk openjre-8-jre openjdk-8-jdk-headless openjre-8-jre-headless via Cumin). This avoids confusion the next time we roll out new Java security updates.
Tue, Jun 11
The routers in codfw have been reconfigured to use ping2004 (confirmed with tcpdump) instead of ping2003.
Mon, Jun 10
All VMs moved off the server. DC ops, can you please have a look? Not sure what "unsupported event" means, never seen that before? Server can be powered off for analysis any time.
I think we should rather base this on a given kernel version? Seems more robust than a given date.
Given an alternative solution was found for etcd, closing this one (after checking with Janis)
Fri, Jun 7
All bastions are on bookworm now