Migrate etcd::tlsproxy Nginx certs and etcd itself to PKI
Open, Needs TriagePublic

Description

The etcd-v3.[eqiad|codfw].wmnet certs used by Nginx on the conf* hosts are currently using a certificate signed by the old Puppet 5 CA using the sslcert::certificate() define and cergen. They need to be moved to the PKI before the conf servers can be migrated to Puppet 7.

profile::etcd::v3 needs to switch to PKI certs as well, there's a Hiera flag use_pki_certs to that effect (already in use by the other etcd clusters we run).

Event Timeline

When we make the change, it will require a restart of etcd on the nodes.

We will need to perform the change and then issue a restart of all pybals connected to the specific server, and when we're done, also restart all confd instances.

After doing the first server, I'd keep an eye on errors from mediawiki as well.

To be clear: this isn't a small change with limited impact, it should be done outside of change freeze periods.

MoritzMuehlenhoff renamed this task from Migrate etcd::tlsproxy Nginx certs to PKI to Migrate etcd::tlsproxy Nginx certs and etcd itself to PKI.Nov 30 2023, 2:09 PM
Scott_French subscribed.

My initial plan was to move etcd to PKI as part of the v3 API migration (T350565), which is also likely to do away with the TLS proxy.

However, the ETA for that is likely measured in "months from now" so I can explore this earlier if needed.

Took a closer look at this today: This should be trivial in the case where we turn up the new v3-API-only etcd cluster using PKI from day 1.

Naively, my main concern would be support for runtime cert reload (for rotation w/o node restarts), but that's been supported since v3.2 [0] (we're on v3.3.25, as are the etcd clusters that support various k8s deployments, which are already on PKI, so no surprise there).

The other concern that comes to mind is eventual rotation of the intermediate CA certs, as etcd is known to have issues with non-disruptive reloading of trusted CA bundles to support rotation [1] (relevant to client auth for peer-peer communication in our case).

Looking more closely at what we actually do, this should not be a problem either: in profile::etcd::v3, the trusted CA cert is simply the internal root CA cert, while the client (peer) certs are chained (i.e., including the intermediate).

Open questions:

  • Is there any value in creating a new intermediate, separate from "etcd" used by clusters supporting k8s?
  • Is there any value in creating distinct certs for client vs. peer connections? (analogous to what we have now)

[0] https://etcd.io/docs/v3.3/op-guide/security/#notes-for-tls-authentication

[1] https://github.com/etcd-io/etcd/issues/11555

Is there any value in creating a new intermediate, separate from "etcd" used by clusters supporting k8s?

The main benefit here would be decoupling between rather different use cases.

For example, this would make it easy to have different (default) signing policies (e.g., expiry). Then again, the latter doesn't really require a new intermediate - we could just add a new signing profile.

The other possible benefit is providing a boundary for client auth between etcd peers. Whether that's really meaningful or not given the realities of our environment I'll need to explore a bit more.

Is there any value in creating distinct certs for client vs. peer connections? (analogous to what we have now)

Note: "analogous to what we have now" is referring specifically to the main etcd cluster (i.e., we have clients connect via the nginx TLS proxy, which uses distinct certs). In k8s, it's the same certs.

The main benefit of separating them would again be different signing policies. IMO, that's not something we need from day 1.

In any case, I think at this point the PoR is to migrate to PKI as part of the v3 API migration, possibly using a different intermediate (this decision has no bearing on the timeline, though).

It seems there's two parts to this migration: The etcd internal cert which will move to PKI along with the v3 migration, but there's also the cert used for the TLS termination. For the latter John made a patch back in 2022 to move it to the discovery PKI cert (https://gerrit.wikimedia.org/r/c/operations/puppet/+/790657) and it seems to me that part could already be addressed independently by rebasing/adapting/merging John's patch? (Given this is a core service one could also adapt the patch and introduce an additional Hiera flag to allow to initially only move one of the conf* nodes instead of the full cluster)

Thanks, @MoritzMuehlenhoff - you're absolutely right that we could decouple these.

The TLS proxy will go away with the v3 migration, since its primary use case will be absorbed into etcd itself (role-based access control). Thus, I think the main question is whether the effort is worth it.

I don't have a great answer for that, but if it would unblock some part of your puppet 5 deprecation work (which is not also blocked on moving etcd to PKI), then I'm happy to explore doing this sooner. Also, if the still-loose timeline for the v3 migration is a problem, I can look into fully migrating off cergen sooner (i.e., etcd included).

If we were to migrate the TLS proxy on its own, then adapting John's patch looks like a good way to go.

A couple of thoughts (mainly notes to myself):

  • My assumption is that, in order to get certain long-lived clients onto the new cert and validate their behavior, we'll need to restart either those clients or nginx (i.e., shut all connections). pybal and confd come to mind. etcd-mirror is an interesting case, as it actually does not reuse connections across watch cycles (urllib3 will drop them upon hitting the socket read timeout).
  • etcd itself should not know or care about the change, as it should never be directly accessing the (TLS proxy) advertised client URLs (it does connect to itself for the gRPC gateway, but always using the "true" client port).
  • conf1008 may be a good canary host, as it's not currently used by any pybal or etcd-mirror (replicating from conf1009 in eqiad).
  • conf1009 will require some extra care, verifying not just pybals (drmrs, esams, magru) but also etcd-mirror (which will exit if the upstream connection is broken, and IIRC will need restarted manually).

Thanks, @MoritzMuehlenhoff - you're absolutely right that we could decouple these.

The TLS proxy will go away with the v3 migration, since its primary use case will be absorbed into etcd itself (role-based access control). Thus, I think the main question is whether the effort is worth it.

Ah, thanks. I wasn't aware of that.

I don't have a great answer for that, but if it would unblock some part of your puppet 5 deprecation work (which is not also blocked on moving etcd to PKI), then I'm happy to explore doing this sooner. Also, if the still-loose timeline for the v3 migration is a problem, I can look into fully migrating off cergen sooner (i.e., etcd included).

No, I think let's avoid this intermediate step, then. Ideally the migration the etcd3 would happen until end of August, not sure how realistic that is? The reliance on the old certs blocks the shutdown of the legacy Puppet 5 servers (which at this point also still have a few other users like the remaining mediawiki buster nodes), so if this were cleared up in the next two months, we could aim towards shutting down the Puppet 5 servers in September.