Let's directly install this server with Puppet 7, there should be no issues in the deployment-server manifests in terms of Puppet 5/7 compat at this point.
- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
Advanced Search
Fri, Jun 28
Thu, Jun 27
Anf FYI, https://gerrit.wikimedia.org/g/operations/software/bitu-ldap is a wrapper for simplifying LDAP operations within Wikimedia (originally written for Bitu, but other Python also use it). Should be helpful for writing the dump script.
In T354855#9931347, @Clement_Goubert wrote:This issue is biting us again, the time between a puppet run that fails to start ferm.service and the subsequent run is enough to cause issues with eventgate, leading to flurries of Can't enqueue job errors in mediawiki. It is in particular triggered when we add new kubernetes nodes, because that causes all other kubernetes nodes in the cluster to update their ferm rules.
Would adding Restart=on-failure to ferm.service for kubernetes worker be a possible short-term solution @MoritzMuehlenhoff ?
Using ldap-maint1001 has the benefit that it already does r/w changes to the r/w slapd servers. Currently we don't restrict that, but we've been gradually shifting r/o access only to the replicas and I'd like to come to a state where the only r/w changes to our LDAP are coming from Horizon (for cloud VPS access management), ldap-maint and Bitu and then all other hosts in production get access denied via firewall rules.
One thing that we could do is to
In T352245#9894693, @Scott_French wrote:Thanks, @MoritzMuehlenhoff - you're absolutely right that we could decouple these.
The TLS proxy will go away with the v3 migration, since its primary use case will be absorbed into etcd itself (role-based access control). Thus, I think the main question is whether the effort is worth it.
I'll take care of this when I'm back from sabbatical
The old nodes have been decommissioned, all done.
Why bullseye, this should be bookworm? docker-registry is packaged in Debian, so we can simply use bookworm and use the package from it. In fact, we are already using the bookworm package on the existing registry hosts (2.8.2+ds1-1)
Wed, Jun 26
Tue, Jun 25
Indeed, CGO_ENABLED=0 rings a bell.
In T365165#9921708, @Jclark-ctr wrote:@MoritzMuehlenhoff would you be able to update site.pp file for this server?
The dependency is added because some feature in the compiled Go code uses syscalls which were only wired up in 2.34 (maybe openat() at al). We ran into this problem before and there was a Go build flag to force it to use a fallback. I can't find a reference currently, but maybe Filippo remembers when he's back.
The Buster instances have been removed.
In T367554#9899621, @jbond wrote:hi all i wanted to say that the sso project is used so that users have an SSO testing infrastructure to use in cloud services. Originally this was also used to provide sso to production like services in cloud services, however this later functionality has been moved.
If there is still a desire to keep a development environment then we will still need all theses machines
- puppetprimary.sso.eqiad1.wikimedia.cloud: The project uses its own puppet master as we have secrets
- sso-db.sso.eqiad1.wikimedia.cloud: This is a mysql db used to store e.g. mfa keys
- sso-pdb.sso.eqiad1.wikimedia.cloud: A puppet db instance, not sure why this is used but guessing the idp classes somehow need some puppetdb functionality.
Mon, Jun 24
In T367757#9905191, @Dzahn wrote:Then the next steps will be creating an SSH key and signing the access agreement. Here are the details:
In T367757#9913302, @kamila wrote:
Given that this is a Go static ELF we can also simply build on bookworm and copy over the deb to bullseye-wikimedia, we're doing this for other exporters as well. buster might be tricky due to it's old libc6, but we can also ignore it, there's less than 150 hosts left and they can simply live the old IPMI monitoring.
Very nice!
CAS 7.0 (what we are currently migrating to) removed the memcached backend. As such, this change won't be needed anymore for the idp servers, I'll tick them off.
Thu, Jun 20
Did one of these changes possinbly break PCC here?
https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler-test/3739/console
And prior to the migration, puppetserver1001 needs to be allowed in profile::tcpircbot
The task can be closed, or is there anything still open?
After the change gets made, kerberos_kadmin_keytabs_repo in hieradata/common.yaml needs to be adapted to point to puppetserver1001
I've pointed the irc.w.o CNAME to irc1002 and rebooted to irc1001 to force a failover to 1002. All bots which were connected to #de.wikipedia reconnected to the new node.
Wed, Jun 19
In T331702#9902698, @MoritzMuehlenhoff wrote:Bullseye-based servers are up and running, one can connect to irc1002.wikimedia.org and irc2002.wikimedia.org the same way as for irc1001/irc2001.
However, when joining the #en.wikipedia channel on irc1001 (current) and irc1002 (new) and comparing the events, there's less output for the latter. So there seems to be some issue with broadcasted mediawiki events (or the way these are ingested on the new installations)
Tue, Jun 18
Bullseye-based servers are up and running, one can connect to irc1002.wikimedia.org and irc2002.wikimedia.org the same way as for irc1001/irc2001.
Happened once more on ganeti2029 today. We're gradually moving nodes to Bookworm (the routed cluster and magru cluster are already running it and the next refreshes in codfw/eqiad will also immediatelly be added with Bookworm), so hopefully the more recent kernel/DRBD addresses this bug.
In T367071#9882394, @Jclark-ctr wrote:@MoritzMuehlenhoff after replacing failed drive looked like it might boot but still fails. Might need to be reimaged I do not have root access so unable to proceed passed this
You are in emergency mode. After logging in, type "journalctl -xGive root passwe
(or press Control-D to continue):
Mon, Jun 17
In T366766#9895937, @Aklapper wrote:What to do? Remove the linking from the "LDAP User" entry on a Phab user profile and make that plain text instead?
One other option: Add a separate wrapper define systemd::timer::job_capped which has the timeout as a mandatory argument (but without a default). And then reach out to SRE teams to migrate the jobs based what the respective uses cases needed.
In T365074#9898255, @kamila wrote:@Milimetric also, you provided an SSH key fingerprint, we need the public key. It should start with something like ssh-rsa AAAA...
In T367544#9897657, @Jelto wrote:packager02.packaging.eqiad1.wikimedia.cloud
According to the etherpad upgrade docs this host is used to build the etherpad Debian package. I also used the host in the past to build the etherpad package. The dedicated host is used because "etherpad builds fetches npm modules during the build time".
In T367487#9891993, @SLyngshede-WMF wrote:I've run a test build, Java 21 is a hard requirement, it cannot be older or newer.
Otherwise the overlay upgrade contains only minor changes. I have not tested the functionality.
The LDAP management parts have been split off to the new ldap-maint1001/ldap-maint2001 hosts.
Fri, Jun 14
I'll look into a Java 21 backport for Bookworm.
It seems there's two parts to this migration: The etcd internal cert which will move to PKI along with the v3 migration, but there's also the cert used for the TLS termination. For the latter John made a patch back in 2022 to move it to the discovery PKI cert (https://gerrit.wikimedia.org/r/c/operations/puppet/+/790657) and it seems to me that part could already be addressed independently by rebasing/adapting/merging John's patch? (Given this is a core service one could also adapt the patch and introduce an additional Hiera flag to allow to initially only move one of the conf* nodes instead of the full cluster)
The old ping servers have been decommed, closing.
Thu, Jun 13
Jun 13 2024
We can close this, the new established procedure is that all servers which get moved from ferm to nftables are rebooted.
Jun 12 2024
The routers in eqiad have been reconfigured to use ping1004 (confirmed with tcpdump) instead of ping1003. I'll decom the old nodes on Friday.
Can you please uninstall openjdk-8-* on the migrated clusters? (simply run dpkg --remove openjdk-8-jdk openjre-8-jre openjdk-8-jdk-headless openjre-8-jre-headless via Cumin). This avoids confusion the next time we roll out new Java security updates.
In T367071#9882394, @Jclark-ctr wrote:@MoritzMuehlenhoff after replacing failed drive looked like it might boot but still fails. Might need to be reimaged I do not have root access so unable to proceed passed this
Jun 11 2024
In T367071#9881345, @Jclark-ctr wrote:@MoritzMuehlenhoff Can i take server down to replace dimm?
The routers in codfw have been reconfigured to use ping2004 (confirmed with tcpdump) instead of ping2003.
Jun 10 2024
All VMs moved off the server. DC ops, can you please have a look? Not sure what "unsupported event" means, never seen that before? Server can be powered off for analysis any time.
I think we should rather base this on a given kernel version? Seems more robust than a given date.
Given an alternative solution was found for etcd, closing this one (after checking with Janis)
Jun 7 2024
All bastions are on bookworm now