User Details
- User Since
- Jan 18 2024, 5:33 PM (25 w, 1 d)
- Availability
- Available
- LDAP User
- Scott French
- MediaWiki User
- SFrench-WMF [ Global Accounts ]
Yesterday
Thu, Jul 11
For the record: v1.0.3 is live in staging only (production is untouched), after it became apparent that additional changes are needed. If a 1.0.4 is available with an updated swagger spec, let me know and I'm happy to assist.
Ah, great - thanks for confirming those older docs will go away, @mforns.
+1 to using a more descriptive name for the resource operated on.
Wed, Jul 10
Cool, it sounds like the conversation has evolved to using a dedicated schema, and we're on the same page that a multi-value set should work (to accommodate reason).
Alright, good(er) news: the service is now live at /api/rest_v1/metrics/commons-analytics.
Great, thank you very much @dcausse for cleaning up the old config and @Clement_Goubert for confirming.
Tue, Jul 9
Ah, thanks for surfacing that, @mforns.
In short, and I realize this doesn't help much, my understanding is that what makes sense as an object name vs. an object tag is really up to you (e.g., ergonomics of tag selectors for common operations).
Found my way here via the highly informative "wdqs-streaming-updater-test-T361935/WMF" user-agent you've used. Thanks for that!
Ah, interesting - I wasn't aware of the prior art with dnsbox. Indeed, reusing node for a fundamentally "host shaped thing" where (1) you anticipate eventually using as-yet unused fields and (2) do not anticipate ever needing to enrich node with new fields, seems less concerning.
@SGupta-WMF - thanks for documenting the API at [0]. One thing I noticed while updating wikitech: it looks like the examples assume the service is reachable at /api/rest_v1/metrics/commons-impact-analytics rather than /api/rest_v1/metrics/commons-impact.
Alright, good news: /api/rest_v1/metrics/commons-impact should now be publicly available.
Thanks for the excellent / detailed write-up, @ssingh!
@mforns - The v1.0.2 image is now live in staging. Please take a look when you get a chance, and let me know if / when you'd like me to proceed with the remaining steps.
Mon, Jul 8
@mforns - The v1.0.1 image is now live in staging. As before, it can be reached internally at https://commons-impact-analytics.k8s-staging.discovery.wmnet:30443/ from any production host.
Tue, Jul 2
@mforns sure, that's no problem at all! Just let me know when the image is ready.
Thanks for taking a look, @xcollazo. I'll defer to @mforns and @SGupta-WMF here, as my quick check was only based on comparison with [0] (which uses timestamp in the public API).
Thanks for the sample data, @xcollazo.
Mon, Jul 1
The service is up and running in staging, and can be reached at https://commons-impact-analytics.k8s-staging.discovery.wmnet:30443 internally.
Fri, Jun 28
I came across this today while looking for prior art on a semi-related theme (invariants to assert on etcd key structure).
Thanks so much, @SGupta-WMF.
Thu, Jun 27
Thanks for giving that a try, @mforns !
Also, I see you hit retry on the failed pipeline. Unfortunately, I just checked in with Ahmon, and a retry is not going to work (sorry about that), as it will only re-run the failed job as-is rather than the full pipeline. It's the latter (full pipeline) that's required after
mw-debug and canaries were updated around 17:23 UTC (alas, I hit enter before adding the message on that scap invocation) and main releases around 17:45. About 40m on, things continue to look good - general service health looks fine, slow-logs work, and dashboards picked up the envoy metrics with the change in cluster name.
Wed, Jun 26
@Clement_Goubert FYI, this was released last week.
Alright, this is now live, and should be sufficient to catch future instances of the scenario that originally motivated this task (missing quota) and more general scenarios that manifest as persistent pod unavailability.
@SGupta-WMF - Ahmon merged [0] this morning, so you should be good to go. If you could please hit Retry on the failed pipeline for v1.0.0 [1] that would be greatly appreciated (I can't, since I'm not a member of the project).
@VirginiaPoundstone - I believe there was one tick mark missing (fixed), but otherwise yes, those are up to date. Once an image is available, we should be able to work through the remainder in short order (the patches are already prepared).
Tue, Jun 25
Checking in on the last week of alerts, I'm seeing:
- Occasional unavailability of a zotero pod in eqiad, as described in T366932#9894216 - i.e., one pod gets very busy (high CPU, etc.) and fails readiness checks for 30-40m (while passing liveness checks).
- A trickle of mw pods in eqiad, especially apparent on mw-jobrunner (peaking at 3 pods), failing mediawiki-main-httpd readiness probes (healthz) for ~ 18h after the first of the two (6/22, 6/24) commons-dumps incidents.
Ah, perfect! Thank you @JMeybohm - search-grafana-dashboards.js uncovered one more dashboard to migrate, and an older one that could be deleted (apple-search). I also decided to go ahead and update "mw on k8s - WIP ServiceOps" since it's easy to stumble upon by accident.
Thanks, @SGupta-WMF! Ahmon tends to be quite responsive to reviewing these, and I see he's already on it. I'll keep an eye out for it to be merged today.
Mon, Jun 24
I've manually updated prometheus queries that previously limited envoy_cluster_name to "local_service" to be compatible with the new naming scheme on the following MW-related dashboards:
- SRE Service Operations > mw-on-k8s Overview
- Service > MediaWiki on k8s
- Service > mw-api-ext
- Service > mw-api-int
- Service > mw-jobrunner
- Service > mw-parsoid
- Service > mw-web
I just spot-checked some of the runPrimaryTransactionIdleCallbacks logs that made it through the 99% throttle in logstash during the "precursor" event starting shortly after 19:30 UTC (logs during the "main" event starting shortly after 20:00 are pretty incomplete, at least at the moment).
Alright, first the good news: I was able to deploy the mediawiki changes to mw-debug and canary releases for one service (mw-api-int), and confirmed that (1) slow logs still work and (2) no obvious increase in error rates etc.
Fri, Jun 21
If we expect that this particularly expensive dumps run is going to take a while, and as a result will cause db1206 to lag behind significantly, would it be possible / make sense to drop the nominal-weight on the instance to zero?
I've added a Maintenance section to the conftool wikitech page, which includes basic build / deploy guidance.
Thanks, @SGupta-WMF - So, it looks like the build-and-publish-production-image jobs are timing out. I suspect this is because they cannot be scheduled to the trusted runners.
Thu, Jun 20
Both patches to remove initialDelaySeconds have been applied without issue.
I believe that should be everything now. I'll follow-up in T368096 for items related to migrating mediawiki to data-gateway.
@Eevans - Can you think of other blockers before mediawiki migrates?
For the actual migration:
Wed, Jun 19
Tue, Jun 18
Forgot to mention: there's still one straggler on 2.3.3 per debmonitor: elastic2099 which down (T367598).
Well, that was a blast!
Mon, Jun 17
From T366851: We now understand the slow-client-startup issue to be the result of connection timeouts when new(er) versions of gocql attempt to connect to the full Cassandra cluster, while network policies prevent cross-DC connectivity. After some discussion about anticipated cross-DC demand, the latter have been relaxed and we should be good to remove the initialDelaySeconds.
@SGupta-WMF - Thanks for letting me know. Given your CI configuration [0], if you could please tag a commit that exists in main (I'm assuming that's your only protected branch) as your initial release, that will produce an image we can use to move forward with turnup in staging.
The network policy and client configuration changes were applied over the course of a 90m window earlier today.
Fri, Jun 14
Thanks, @MoritzMuehlenhoff - you're absolutely right that we could decouple these.
Alert works:
Thu, Jun 13
While the only significant functional changes in this release are for dbctl and requestctl (installed to a small number of hosts), I of course still need to update all hosts that have conftool installed.
Next steps:
- Let this soak for a bit and check the noise level of KubernetesDeploymentUnavailableReplicas [0] (currently warning, so it matches no receivers).
- Follow up with o11y about patterns for alert duplication across teams - i.e., currently this is applied to k8s-mlserve as well, but those should ideally be routed to team: ml rather than sre.
- Tune if needed and (eventually) promote to critical.
After chatting with I/F folks about the ~ 480Kbps cross-DC flow, it seems this is small enough not to worry about.
Alas, this is unlikely to happen this week within scheduling constraints. Now aiming for Tuesday the 18th (next week) starting at 14:00 UTC. I'll advertise this in IRC on Monday and again Tuesday.
Jun 12 2024
Well, nothing says "progress" like a revert.
Jun 7 2024
Cool, this is probably a good opportunity to verify / fix the use of the datacentre values too (i.e., local_dc).
Nice find @brouberol and nice test @Eevans!
Jun 6 2024
Packages for buster, bullseye, and bookworm are ready, and have been copied over to apt1002.
Many thanks to @Eevans for humoring my experiments.
Based on the feedback so far, I believe we're done with the documentation changes then, so I'll resolve this :)
Adding an initialDelaySeconds (30) seems to have done the trick.
I just had a very interesting conversation with @Sfaci about the initialDelaySeconds recently added to AQS 2.0 services.
Thanks for taking a look, @hnowlan!
Jun 5 2024
The service is turned up in staging and was verified against the commons impact metrics dataset present in cassandra staging at the time (subsequently dropped to facilitate T364583).
Many thanks for reviewing, @Marostegui and @ABran-WMF. Also, neat idea writing a script to automate the steps, Arnaud!
Jun 4 2024
Thanks for taking a look, @Marostegui!
Thanks for the update, @SGupta-WMF - that's great!
Jun 3 2024
Added k8s secret for the commons_impact_analytics role to private puppet in 19d63430dbe1d6f4651bab4fe4162bbf3462e97e.
May 31 2024
All three items have been updated. Two points of note: