Grant slightly broader access to Klaxon
Open, Needs TriagePublic

Description

Currently, Klaxon is available to LDAP members of wmf, wmde, nda, and ops. When anyone else needs to manually page the SRE team, they have to find someone in one of those groups and ask for help.

It's intentional that access is limited: Klaxon can be used to page us when we're not working, and that should be available only to people we trust to use it properly. Even well-intended but incorrect usage, like paging SRE for something that another team needs to handle, would be serious. SREs are able to react urgently to paging alerts because they're rare. If we got spurious pages more than very occasionally, it would tend to create alert fatigue and reduce responsiveness for critical issues, on top of being an unfair intrusion into non-working hours.

But for appropriately trusted community members, Klaxon access would make it easier to get hold of us when we do need to act. Granting access to stewards might be a reasonable place to start, particularly because some stewards have had an easy time getting NDA access and getting access to the tool that way (but full NDA access isn't strictly necessary here).

Separately, we may also want to be able to add trusted individuals such as wiki admins ad hoc, while those individuals are working on active abuse incidents that might require them to page us.

Things to do here, as far as I can tell:

  • Find out whether there's agreement in principle, probably via discussion at the SRE meeting, that we're comfortable with a larger number of trusted community members being able to reach us directly in an emergency
  • Create at least an LDAP group for individual Klaxon access (klaxon-users?) and possibly also a more descriptive group like stewards for role-based access
  • Grant those groups access to Klaxon in puppet and update Klaxon docs
  • Update SRE clinic duty docs so that clinicians can handle steward access requests
  • Communicate to stewards that they can use Klaxon to page us, and what they should use it for; include instructions to create a developer account and get it added

Related Objects

Event Timeline

Only two blockers were raised at the August 7 SRE meeting:

  • Training/docs: We should make sure that anyone who can use Klaxon fully understands what situations it should and shouldn't be used for. The existing language at wikitech:Klaxon and on Klaxon's UI itself is intended to cover this, but we'll make sure it's still sufficiently clear to a broader audience.
  • Technical: If we can control access directly via steward usergroup membership, then we won't have to manually sync access and add a bunch of clinic duty tickets. The downside is that Klaxon currently only uses developer SSO accounts, so some extra engineering might be required. (This might increase Klaxon's dependency on the rest of our infrastructure, but as previously documented (§Klaxon is hosted on...) that's okay; existing monitoring will still automatically alert us to the kinds of sweeping outages that would threaten Klaxon.)

Both of those should be workable. Any further SRE feedback is still welcome (including feedback like "I think we shouldn't do this at all, because...") and I'm happy to proxy it anonymously on behalf of anyone who wants to share it with me confidentially.

One issue that I raised, but perhaps was not captured anywhere is adding some guidance to the documentation on how the folks being paged can communicate with person who used Klaxon. For instance should I assume the person using Klaxon is on IRC and their is a channel we can chat in about the incident?

Is keeping an LDAP group up-to-date with all the stewards something the new IDM could possibly do in the future?

One issue that I raised, but perhaps was not captured anywhere is adding some guidance to the documentation on how the folks being paged can communicate with person who used Klaxon. For instance should I assume the person using Klaxon is on IRC and their is a channel we can chat in about the incident?

Oops, sorry for missing that. The form does say "include how to best get in touch with you," and it's probably already going to be on the user's mind too -- they're there because they want to hear from you, pretty urgently!

But if they forget, their LDAP email address is automatically included in the page, so you could get hold of them that way. (Depending on how we set this up, "LDAP email address" might be replaced with something else, but your point is a good one, and we should make sure to replace it with some sort of direct contact information, not just a username.)

Is keeping an LDAP group up-to-date with all the stewards something the new IDM could possibly do in the future?

Definitely plausible. But it won't be a feature of the new IDM at launch, and we don't want the new IDM to be a blocker for any progress on this task, so we should probably save that for later and build something without it in the nearer term even if it's imperfect. (Adding @SLyngshede-WMF and @joanna_borun to confirm that sounds right.)

One issue that I raised, but perhaps was not captured anywhere is adding some guidance to the documentation on how the folks being paged can communicate with person who used Klaxon. For instance should I assume the person using Klaxon is on IRC and their is a channel we can chat in about the incident?

Oops, sorry for missing that. The form does say "include how to best get in touch with you," and it's probably already going to be on the user's mind too -- they're there because they want to hear from you, pretty urgently!

But if they forget, their LDAP email address is automatically included in the page, so you could get hold of them that way. (Depending on how we set this up, "LDAP email address" might be replaced with something else, but your point is a good one, and we should make sure to replace it with some sort of direct contact information, not just a username.)

Email doesn't seem like a great way to communicate for page worthy incidents, would it be possible to insist on IRC and have them input their handle as part of the Klaxon flow, since we already recommend they chat with us in #wikimedia-sre?

Email doesn't seem like a great way to communicate for page worthy incidents, would it be possible to insist on IRC and have them input their handle as part of the Klaxon flow, since we already recommend they chat with us in #wikimedia-sre?

I'd want @CDanis to weigh in on that, since it's really a Klaxon design decision, but personally I don't think a required field is the right solution. Email isn't our usual medium, but it's better than nothing: if you imagine a trusted user who needs to report a genuine emergency, but doesn't use IRC, they wouldn't be able to contact us at all. Either they'd just give up, or they'd have to go learn what an IRC client is before even being able to tell us something is wrong.

We wouldn't want to exclusively use email during an incident, but I think it's a reasonable way to get in touch and bootstrap to something snappier ("hi, I got your message, please join us in #-sre") -- and I also think it'll be rare that we get a page from someone who doesn't mention it in the first place.

But I get that there are conflicting priorities here ("it should be easy to reach us" vs. "it should be easy for us to reply", both of which are right!) and I'm open to discussing -- and also open to revisiting later if it turns out we get more pages like that than I thought.

Definitely plausible. But it won't be a feature of the new IDM at launch, and we don't want the new IDM to be a blocker for any progress on this task, so we should probably save that for later and build something without it in the nearer term even if it's imperfect. (Adding @SLyngshede-WMF and @joanna_borun to confirm that sounds right.)

If it's just a matter of managing a LDAP group, then that's perfectly within scope of the IDM. It's one of the features that already have a first-attempt implementation. How users are added to the LDAP group until this feature is released is less important as we're importing/reading data from LDAP.

If it's just a matter of managing a LDAP group, then that's perfectly within scope of the IDM. It's one of the features that already have a first-attempt implementation.

Yeah, for sure -- I meant specifically managing an LDAP group by syncing it from a MediaWiki usergroup, in this case the stewards global group. Is anything like that on the roadmap?

(I understand that's not a fully-specified feature request, since some important details like account mapping are left out -- if we go that route we can discuss more in another task.)

If it's just a matter of managing a LDAP group, then that's perfectly within scope of the IDM. It's one of the features that already have a first-attempt implementation.

Yeah, for sure -- I meant specifically managing an LDAP group by syncing it from a MediaWiki usergroup, in this case the stewards global group. Is anything like that on the roadmap?

I'd like to highlight I've filled T344164: VMs requested for stewards a few days ago. My goal is to automate steward/functionaries onboarding needs with a script on that future production VM (there's a whole lot of credentials to issue to new stewards, unfortunately). In theory, that script could have LDAP write permissions and handle the LDAP part as well. Alternatively, it could expose a list of stewards internally to the cluster, and IDM (or whatever) could read it and automatically give stewards additional privileges. Would that be helpful?

As a matter of the first step, I don't think we need to give all stewards access to Klaxon immediately. This is already the case for technical-ish permissions (such as access to Toolforge tool for stewards). Considering stewards have themselves a paging mechanism, if a non-privileged steward has the need to use Klaxon, they could page other stewards and escalate the problem that way.

Certainly, that is not an ideal long-term solution, but it is definitely an improvement from the current status quo. Assuming SRE clinic docs get updated, the group of stewards with access could enlarge organically for now. Once there exists an automation system for the onboarding (from T344164 or elsewhere) that we could wire to, we would be able to easily expand access to all stewards. Would that make sense?

@Urbanecm what's the status of the stewards VM / onboarding tool?

@Urbanecm what's the status of the stewards VM / onboarding tool?

The VM is up and running, and we're currently experimenting with automatically managing stewards-l in Mailman with it (as of now, it runs in a dry-run mode). If that sounds good with you, we can start with something in LDAP as well. What do you think?

That sounds good to me, maybe check in with @SLyngshede-WMF about the LDAP sync (assuming he's taken that on while @MoritzMuehlenhoff is out?)

@SLyngshede-WMF curious to hear what possibilities do we have for automatically granting LDAP access from stewards1001? Would it be helpful if we generated a list of developer accounts somewhere in that machine? Or should we do something similar?

One thing that we could do is to

  • Write a script which parses the complete stewards list from $DATASOURCE and retrieves the shell UIDs of their Wikimedia Developer accounts
  • Add a systemd timer on ldap-maint1001/ldap-maint2001 (we already have a flag to only run timers on the active node) which updates cn=klaxon-users based on this list
  • ldap-maint already has r/w access to LDAP via cn=scriptuser (credentials can be read from /etc/ldapvi.conf on ldap-maint)

Using ldap-maint1001 has the benefit that it already does r/w changes to the r/w slapd servers. Currently we don't restrict that, but we've been gradually shifting r/o access only to the replicas and I'd like to come to a state where the only r/w changes to our LDAP are coming from Horizon (for cloud VPS access management), ldap-maint and Bitu and then all other hosts in production get access denied via firewall rules.

Thanks Moritz, that sounds great to me.

@Urbanecm are you interested in writing some patches if I do code reviews?

Anf FYI, https://gerrit.wikimedia.org/g/operations/software/bitu-ldap is a wrapper for simplifying LDAP operations within Wikimedia (originally written for Bitu, but other Python also use it). Should be helpful for writing the dump script.

One thing that we could do is to

  • Write a script which parses the complete stewards list from $DATASOURCE and retrieves the shell UIDs of their Wikimedia Developer accounts

No problem! I added support for LDAP into the onboarder system I'm creating. I have the following ready:

[urbanecm@stewards1001 ~]$ ls /srv/exports
[urbanecm@stewards1001 ~]$ python3 /srv/repos/onboarding-system/onboarder.py update
== Updating mailman_list
== Updating ldap_group
[urbanecm@stewards1001 ~]$ ls /srv/exports
ldap_group  mailman_list
[urbanecm@stewards1001 ~]$ cat /srv/exports/ldap_group/steward.txt 
anticomposite
base
deltaquad
ep1c
h2o
hasley
hoo
melos
radimer
superpes
urbanecm
[urbanecm@stewards1001 ~]$

Design-wise, this is similar to automatically managing mailman lists (which generates datasets into /srv/exports/mailman_list, which is then rsynced to the lists hosts and processed). The list of UIDs is not yet complete, but as the datasource (stewards1001:/srv/repos/users-db) grows, it would be.

This is generated via https://gitlab.wikimedia.org/repos/stewards/onboarding-system; config for who gets access to what is at https://gitlab.wikimedia.org/repos/stewards/onboarding-system/-/blob/main/config/roles.yaml?ref_type=heads.

  • Add a systemd timer on ldap-maint1001/ldap-maint2001 (we already have a flag to only run timers on the active node) which updates cn=klaxon-users based on this list

I'm not sure how would updating cn=klaxon-users work. We can certainly automatically add all new accounts from the txt file I'm generating, but how would removal work? If someone is in cn=klaxon-users, but not in the txt file, how would we know whether they were added to klaxon-users by virtue of being a steward, or because of some other process (that is not related to stewards onboarding at all)? Wouldn't we need to create cn=stewards instead, which would be fully managed by the syncing process, and leave klaxon-users for manual additions instead? Or am I misunderstanding some of this?

@Urbanecm are you interested in writing some patches if I do code reviews?

I wrote some code for the stewards end of thing, but I'm lost at what would I need to do to actually get the changes synced to LDAP. If you can offer some guidance, I can try though :).

One thing that we could do is to

  • Write a script which parses the complete stewards list from $DATASOURCE and retrieves the shell UIDs of their Wikimedia Developer accounts

No problem! I added support for LDAP into the onboarder system I'm creating. I have the following ready:

[urbanecm@stewards1001 ~]$ ls /srv/exports
[urbanecm@stewards1001 ~]$ python3 /srv/repos/onboarding-system/onboarder.py update
== Updating mailman_list
== Updating ldap_group
[urbanecm@stewards1001 ~]$ ls /srv/exports
ldap_group  mailman_list
[urbanecm@stewards1001 ~]$ cat /srv/exports/ldap_group/steward.txt 
anticomposite
base
deltaquad
ep1c
h2o
hasley
hoo
melos
radimer
superpes
urbanecm
[urbanecm@stewards1001 ~]$

Design-wise, this is similar to automatically managing mailman lists (which generates datasets into /srv/exports/mailman_list, which is then rsynced to the lists hosts and processed). The list of UIDs is not yet complete, but as the datasource (stewards1001:/srv/repos/users-db) grows, it would be.

This is generated via https://gitlab.wikimedia.org/repos/stewards/onboarding-system; config for who gets access to what is at https://gitlab.wikimedia.org/repos/stewards/onboarding-system/-/blob/main/config/roles.yaml?ref_type=heads.

This is great, thanks. One remaining piece here is to figure out how best to get this data to a context where we can edit LDAP like ldap-maint1001. Maybe the simplest way is using rsync::server::module to expose the /srv/exports/ldap_group or /srv/exports directory?

  • Add a systemd timer on ldap-maint1001/ldap-maint2001 (we already have a flag to only run timers on the active node) which updates cn=klaxon-users based on this list

I'm not sure how would updating cn=klaxon-users work. We can certainly automatically add all new accounts from the txt file I'm generating, but how would removal work? If someone is in cn=klaxon-users, but not in the txt file, how would we know whether they were added to klaxon-users by virtue of being a steward, or because of some other process (that is not related to stewards onboarding at all)? Wouldn't we need to create cn=stewards instead, which would be fully managed by the syncing process, and leave klaxon-users for manual additions instead? Or am I misunderstanding some of this?

Your proposal is better, let's do that.

@Urbanecm are you interested in writing some patches if I do code reviews?

I wrote some code for the stewards end of thing, but I'm lost at what would I need to do to actually get the changes synced to LDAP. If you can offer some guidance, I can try though :).

A Python script that is able to read a text file and perform LDAP group membership updates is probably enough. There are a few examples in our puppet repo -- one simple example is rewrite-group-for-memberof.py. And offboard_user is a more complicated one. And actually, sync_ldap_project_group() in ldapgroups.py is quite close to the logic you want (just again more complicated -- here we shouldn't create the group if it doesn't already exist).

If you aren't already familiar with LDAP this is probably too much to ask, though, and I could find some time for it soon.

This is great, thanks. One remaining piece here is to figure out how best to get this data to a context where we can edit LDAP like ldap-maint1001. Maybe the simplest way is using rsync::server::module to expose the /srv/exports/ldap_group or /srv/exports directory?

FWIW, the whole /srv/exports directory is already exported via rsync::server::module (for lists servers). I think we only need to add ldap-maint1001 to hosts_allow and add a rsync timer to ldap-maint, but I might be missing something here.

  • Add a systemd timer on ldap-maint1001/ldap-maint2001 (we already have a flag to only run timers on the active node) which updates cn=klaxon-users based on this list

I'm not sure how would updating cn=klaxon-users work. We can certainly automatically add all new accounts from the txt file I'm generating, but how would removal work? If someone is in cn=klaxon-users, but not in the txt file, how would we know whether they were added to klaxon-users by virtue of being a steward, or because of some other process (that is not related to stewards onboarding at all)? Wouldn't we need to create cn=stewards instead, which would be fully managed by the syncing process, and leave klaxon-users for manual additions instead? Or am I misunderstanding some of this?

Your proposal is better, let's do that.

Ack.

@Urbanecm are you interested in writing some patches if I do code reviews?

I wrote some code for the stewards end of thing, but I'm lost at what would I need to do to actually get the changes synced to LDAP. If you can offer some guidance, I can try though :).

A Python script that is able to read a text file and perform LDAP group membership updates is probably enough. There are a few examples in our puppet repo -- one simple example is rewrite-group-for-memberof.py. And offboard_user is a more complicated one. And actually, sync_ldap_project_group() in ldapgroups.py is quite close to the logic you want (just again more complicated -- here we shouldn't create the group if it doesn't already exist).

If you aren't already familiar with LDAP this is probably too much to ask, though, and I could find some time for it soon.

Unfortunately, my LDAP experience is very limited – I know how to make basic searches using ldapsearch, but I never tried updating LDAP groups via a script. So, not sure if I'm the best choice for this.