Scott_French (Scott French)
User

Projects

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Monday

  • Clear sailing ahead.

User Details

User Since
Jan 18 2024, 5:33 PM (25 w, 1 d)
Availability
Available
LDAP User
Scott French
MediaWiki User
SFrench-WMF [ Global Accounts ]

Recent Activity

Yesterday

Scott_French triaged T369932: sextant: support module garbage collection as Low priority.
Fri, Jul 12, 6:53 PM · serviceops
Scott_French created T369932: sextant: support module garbage collection.
Fri, Jul 12, 6:53 PM · serviceops
Scott_French triaged T369921: Support warmup for local caches in mw-on-k8s as Low priority.
Fri, Jul 12, 5:13 PM · Patch-For-Review, serviceops
Scott_French created T369921: Support warmup for local caches in mw-on-k8s.
Fri, Jul 12, 4:36 PM · Patch-For-Review, serviceops

Thu, Jul 11

Scott_French added a comment to T369745: Fix errors in Commons Analytics OpenAPI spec.

For the record: v1.0.3 is live in staging only (production is untouched), after it became apparent that additional changes are needed. If a 1.0.4 is available with an updated swagger spec, let me know and I'm happy to assist.

Thu, Jul 11, 7:29 PM · Patch-For-Review, Data Products (Data Products Sprint 16), Commons-Impact-Metrics, Documentation, AQS2.0
Scott_French closed T361835: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production as Resolved.

Ah, great - thanks for confirming those older docs will go away, @mforns.

Thu, Jul 11, 5:04 PM · Data Products (Data Products Sprint 16), Patch-For-Review, serviceops, Service-deployment-requests, SRE
Scott_French added a comment to T369366: Migrate DNS depooling of sites from operations/dns (git) to confctl.

+1 to using a more descriptive name for the resource operated on.

Thu, Jul 11, 3:44 PM · Patch-For-Review, SRE, Traffic

Wed, Jul 10

Scott_French added a comment to T369366: Migrate DNS depooling of sites from operations/dns (git) to confctl.

Cool, it sounds like the conversation has evolved to using a dedicated schema, and we're on the same page that a multi-value set should work (to accommodate reason).

Wed, Jul 10, 9:30 PM · Patch-For-Review, SRE, Traffic
Scott_French added a comment to T361835: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production.

Alright, good(er) news: the service is now live at /api/rest_v1/metrics/commons-analytics.

Wed, Jul 10, 4:24 PM · Data Products (Data Products Sprint 16), Patch-For-Review, serviceops, Service-deployment-requests, SRE
Scott_French added a comment to T361935: Adapt the WDQS Streaming Updater to update multiple WDQS subgraphs.

Great, thank you very much @dcausse for cleaning up the old config and @Clement_Goubert for confirming.

Wed, Jul 10, 3:53 PM · Discovery-Search (Current work), Wikidata

Tue, Jul 9

Scott_French added a comment to T361835: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production.

Ah, thanks for surfacing that, @mforns.

Tue, Jul 9, 11:48 PM · Data Products (Data Products Sprint 16), Patch-For-Review, serviceops, Service-deployment-requests, SRE
Scott_French added a comment to T369366: Migrate DNS depooling of sites from operations/dns (git) to confctl.

In short, and I realize this doesn't help much, my understanding is that what makes sense as an object name vs. an object tag is really up to you (e.g., ergonomics of tag selectors for common operations).

Tue, Jul 9, 9:49 PM · Patch-For-Review, SRE, Traffic
Scott_French updated subscribers of T361935: Adapt the WDQS Streaming Updater to update multiple WDQS subgraphs.

Found my way here via the highly informative "wdqs-streaming-updater-test-T361935/WMF" user-agent you've used. Thanks for that!

Tue, Jul 9, 8:01 PM · Discovery-Search (Current work), Wikidata
Scott_French added a comment to T369366: Migrate DNS depooling of sites from operations/dns (git) to confctl.

Ah, interesting - I wasn't aware of the prior art with dnsbox. Indeed, reusing node for a fundamentally "host shaped thing" where (1) you anticipate eventually using as-yet unused fields and (2) do not anticipate ever needing to enrich node with new fields, seems less concerning.

Tue, Jul 9, 6:27 PM · Patch-For-Review, SRE, Traffic
Scott_French added a comment to T361835: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production.

@SGupta-WMF - thanks for documenting the API at [0]. One thing I noticed while updating wikitech: it looks like the examples assume the service is reachable at /api/rest_v1/metrics/commons-impact-analytics rather than /api/rest_v1/metrics/commons-impact.

Tue, Jul 9, 5:33 PM · Data Products (Data Products Sprint 16), Patch-For-Review, serviceops, Service-deployment-requests, SRE
Scott_French updated the task description for T361835: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production.
Tue, Jul 9, 5:28 PM · Data Products (Data Products Sprint 16), Patch-For-Review, serviceops, Service-deployment-requests, SRE
Scott_French added a comment to T361835: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production.

Alright, good news: /api/rest_v1/metrics/commons-impact should now be publicly available.

Tue, Jul 9, 5:22 PM · Data Products (Data Products Sprint 16), Patch-For-Review, serviceops, Service-deployment-requests, SRE
Scott_French updated the task description for T361835: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production.
Tue, Jul 9, 5:14 PM · Data Products (Data Products Sprint 16), Patch-For-Review, serviceops, Service-deployment-requests, SRE
Scott_French updated the task description for T361835: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production.
Tue, Jul 9, 4:30 PM · Data Products (Data Products Sprint 16), Patch-For-Review, serviceops, Service-deployment-requests, SRE
Scott_French added a comment to T369366: Migrate DNS depooling of sites from operations/dns (git) to confctl.

Thanks for the excellent / detailed write-up, @ssingh!

Tue, Jul 9, 3:58 PM · Patch-For-Review, SRE, Traffic
Scott_French added a comment to T361835: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production.

@mforns - The v1.0.2 image is now live in staging. Please take a look when you get a chance, and let me know if / when you'd like me to proceed with the remaining steps.

Tue, Jul 9, 2:57 PM · Data Products (Data Products Sprint 16), Patch-For-Review, serviceops, Service-deployment-requests, SRE

Mon, Jul 8

Scott_French added a comment to T361835: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production.

@mforns - The v1.0.1 image is now live in staging. As before, it can be reached internally at https://commons-impact-analytics.k8s-staging.discovery.wmnet:30443/ from any production host.

Mon, Jul 8, 3:54 PM · Data Products (Data Products Sprint 16), Patch-For-Review, serviceops, Service-deployment-requests, SRE

Tue, Jul 2

Scott_French added a comment to T361835: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production.

@mforns sure, that's no problem at all! Just let me know when the image is ready.

Tue, Jul 2, 7:59 PM · Data Products (Data Products Sprint 16), Patch-For-Review, serviceops, Service-deployment-requests, SRE
Scott_French added a comment to T361835: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production.

Thanks for taking a look, @xcollazo. I'll defer to @mforns and @SGupta-WMF here, as my quick check was only based on comparison with [0] (which uses timestamp in the public API).

Tue, Jul 2, 5:01 PM · Data Products (Data Products Sprint 16), Patch-For-Review, serviceops, Service-deployment-requests, SRE
Scott_French added a comment to T361835: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production.

Thanks for the sample data, @xcollazo.

Tue, Jul 2, 3:52 PM · Data Products (Data Products Sprint 16), Patch-For-Review, serviceops, Service-deployment-requests, SRE

Mon, Jul 1

Scott_French added a comment to T361835: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production.

The service is up and running in staging, and can be reached at https://commons-impact-analytics.k8s-staging.discovery.wmnet:30443 internally.

Mon, Jul 1, 5:57 PM · Data Products (Data Products Sprint 16), Patch-For-Review, serviceops, Service-deployment-requests, SRE
Scott_French updated the task description for T361835: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production.
Mon, Jul 1, 5:27 PM · Data Products (Data Products Sprint 16), Patch-For-Review, serviceops, Service-deployment-requests, SRE

Fri, Jun 28

Scott_French added a comment to T350656: dbconfig bug - "2 instances found for query ...".

I came across this today while looking for prior art on a semi-related theme (invariants to assert on etcd key structure).

Fri, Jun 28, 7:43 PM · Data-Persistence, Patch-For-Review, conftool
Scott_French updated the task description for T361835: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production.
Fri, Jun 28, 4:12 PM · Data Products (Data Products Sprint 16), Patch-For-Review, serviceops, Service-deployment-requests, SRE
Scott_French added a comment to T361835: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production.

Thanks so much, @SGupta-WMF.

Fri, Jun 28, 4:11 PM · Data Products (Data Products Sprint 16), Patch-For-Review, serviceops, Service-deployment-requests, SRE

Thu, Jun 27

Scott_French added a comment to T361835: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production.

Thanks for giving that a try, @mforns !

Thu, Jun 27, 9:05 PM · Data Products (Data Products Sprint 16), Patch-For-Review, serviceops, Service-deployment-requests, SRE
Scott_French added a comment to T361835: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production.

Also, I see you hit retry on the failed pipeline. Unfortunately, I just checked in with Ahmon, and a retry is not going to work (sorry about that), as it will only re-run the failed job as-is rather than the full pipeline. It's the latter (full pipeline) that's required after

Thu, Jun 27, 7:14 PM · Data Products (Data Products Sprint 16), Patch-For-Review, serviceops, Service-deployment-requests, SRE
Scott_French updated the task description for T362978: Update all helm modules and charts to be compatible with the restricted PSS.
Thu, Jun 27, 6:24 PM · Patch-For-Review, serviceops, Prod-Kubernetes
Scott_French added a comment to T362978: Update all helm modules and charts to be compatible with the restricted PSS.

mw-debug and canaries were updated around 17:23 UTC (alas, I hit enter before adding the message on that scap invocation) and main releases around 17:45. About 40m on, things continue to look good - general service health looks fine, slow-logs work, and dashboards picked up the envoy metrics with the change in cluster name.

Thu, Jun 27, 6:24 PM · Patch-For-Review, serviceops, Prod-Kubernetes

Wed, Jun 26

Scott_French added a comment to T355256: requestctl should fail with error if fails parsing yaml file.

@Clement_Goubert FYI, this was released last week.

Wed, Jun 26, 10:14 PM · conftool, serviceops
Scott_French closed T366932: Alerting on under-scaled deployments as Resolved.

Alright, this is now live, and should be sufficient to catch future instances of the scenario that originally motivated this task (missing quota) and more general scenarios that manifest as persistent pod unavailability.

Wed, Jun 26, 10:11 PM · serviceops
Scott_French added a comment to T361835: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production.

@SGupta-WMF - Ahmon merged [0] this morning, so you should be good to go. If you could please hit Retry on the failed pipeline for v1.0.0 [1] that would be greatly appreciated (I can't, since I'm not a member of the project).

Wed, Jun 26, 6:55 PM · Data Products (Data Products Sprint 16), Patch-For-Review, serviceops, Service-deployment-requests, SRE
Scott_French added a comment to T361835: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production.

@VirginiaPoundstone - I believe there was one tick mark missing (fixed), but otherwise yes, those are up to date. Once an image is available, we should be able to work through the remainder in short order (the patches are already prepared).

Wed, Jun 26, 3:38 PM · Data Products (Data Products Sprint 16), Patch-For-Review, serviceops, Service-deployment-requests, SRE
Scott_French updated the task description for T361835: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production.
Wed, Jun 26, 3:34 PM · Data Products (Data Products Sprint 16), Patch-For-Review, serviceops, Service-deployment-requests, SRE

Tue, Jun 25

Scott_French added a comment to T366932: Alerting on under-scaled deployments.

Checking in on the last week of alerts, I'm seeing:

  1. Occasional unavailability of a zotero pod in eqiad, as described in T366932#9894216 - i.e., one pod gets very busy (high CPU, etc.) and fails readiness checks for 30-40m (while passing liveness checks).
  2. A trickle of mw pods in eqiad, especially apparent on mw-jobrunner (peaking at 3 pods), failing mediawiki-main-httpd readiness probes (healthz) for ~ 18h after the first of the two (6/22, 6/24) commons-dumps incidents.
Tue, Jun 25, 7:40 PM · serviceops
Scott_French added a comment to T362978: Update all helm modules and charts to be compatible with the restricted PSS.

Ah, perfect! Thank you @JMeybohm - search-grafana-dashboards.js uncovered one more dashboard to migrate, and an older one that could be deleted (apple-search). I also decided to go ahead and update "mw on k8s - WIP ServiceOps" since it's easy to stumble upon by accident.

Tue, Jun 25, 4:31 PM · Patch-For-Review, serviceops, Prod-Kubernetes
Scott_French added a comment to T361835: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production.

Thanks, @SGupta-WMF! Ahmon tends to be quite responsive to reviewing these, and I see he's already on it. I'll keep an eye out for it to be merged today.

Tue, Jun 25, 2:40 PM · Data Products (Data Products Sprint 16), Patch-For-Review, serviceops, Service-deployment-requests, SRE

Mon, Jun 24

Scott_French added a comment to T362978: Update all helm modules and charts to be compatible with the restricted PSS.

I've manually updated prometheus queries that previously limited envoy_cluster_name to "local_service" to be compatible with the new naming scheme on the following MW-related dashboards:

  • SRE Service Operations > mw-on-k8s Overview
  • Service > MediaWiki on k8s
  • Service > mw-api-ext
  • Service > mw-api-int
  • Service > mw-jobrunner
  • Service > mw-parsoid
  • Service > mw-web
Mon, Jun 24, 11:48 PM · Patch-For-Review, serviceops, Prod-Kubernetes
Scott_French added a comment to T368098: Dumps generation without prefetch cause disruption to the production environment.

I just spot-checked some of the runPrimaryTransactionIdleCallbacks logs that made it through the 99% throttle in logstash during the "precursor" event starting shortly after 19:30 UTC (logs during the "main" event starting shortly after 20:00 are pretty incomplete, at least at the moment).

Mon, Jun 24, 10:48 PM · Data Products (Data Products Sprint 16), MW-1.43-notes (1.43.0-wmf.11; 2024-06-25), Patch-For-Review, Data-Engineering, Dumps-Generation, SRE
Scott_French added a comment to T362978: Update all helm modules and charts to be compatible with the restricted PSS.

Alright, first the good news: I was able to deploy the mediawiki changes to mw-debug and canary releases for one service (mw-api-int), and confirmed that (1) slow logs still work and (2) no obvious increase in error rates etc.

Mon, Jun 24, 6:41 PM · Patch-For-Review, serviceops, Prod-Kubernetes

Fri, Jun 21

Scott_French added a comment to T368098: Dumps generation without prefetch cause disruption to the production environment.

If we expect that this particularly expensive dumps run is going to take a while, and as a result will cause db1206 to lag behind significantly, would it be possible / make sense to drop the nominal-weight on the instance to zero?

Fri, Jun 21, 11:19 PM · Data Products (Data Products Sprint 16), MW-1.43-notes (1.43.0-wmf.11; 2024-06-25), Patch-For-Review, Data-Engineering, Dumps-Generation, SRE
Scott_French added a comment to T367921: Improve build / release procedure for conftool.

I've added a Maintenance section to the conftool wikitech page, which includes basic build / deploy guidance.

Fri, Jun 21, 6:19 PM · conftool
Scott_French added a comment to T361835: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production.

Thanks, @SGupta-WMF - So, it looks like the build-and-publish-production-image jobs are timing out. I suspect this is because they cannot be scheduled to the trusted runners.

Fri, Jun 21, 3:20 PM · Data Products (Data Products Sprint 16), Patch-For-Review, serviceops, Service-deployment-requests, SRE

Thu, Jun 20

Scott_French added a comment to T366851: gocql startup times have increased between v1.2.0 and v1.6.0.

Both patches to remove initialDelaySeconds have been applied without issue.

Thu, Jun 20, 9:29 PM · Cassandra
Scott_French closed T364921: Commons Impact Metrics: Data Gateway endpoints as Resolved.

I believe that should be everything now. I'll follow-up in T368096 for items related to migrating mediawiki to data-gateway.

Thu, Jun 20, 6:47 PM · Data Products, Cassandra, serviceops, Service-deployment-requests, SRE
Scott_French triaged T368096: mediawiki: migrate from image-suggestion to data-gateway as Low priority.
Thu, Jun 20, 6:46 PM · Cassandra, serviceops
Scott_French updated the task description for T364921: Commons Impact Metrics: Data Gateway endpoints.
Thu, Jun 20, 6:46 PM · Data Products, Cassandra, serviceops, Service-deployment-requests, SRE
Scott_French closed T364921: Commons Impact Metrics: Data Gateway endpoints, a subtask of T361835: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production, as Resolved.
Thu, Jun 20, 6:45 PM · Data Products (Data Products Sprint 16), Patch-For-Review, serviceops, Service-deployment-requests, SRE
Scott_French added a comment to T368096: mediawiki: migrate from image-suggestion to data-gateway.

@Eevans - Can you think of other blockers before mediawiki migrates?

Thu, Jun 20, 6:43 PM · Cassandra, serviceops
Scott_French added a comment to T368096: mediawiki: migrate from image-suggestion to data-gateway.

For the actual migration:

Thu, Jun 20, 6:39 PM · Cassandra, serviceops
Scott_French created T368096: mediawiki: migrate from image-suggestion to data-gateway.
Thu, Jun 20, 6:37 PM · Cassandra, serviceops
Scott_French updated the task description for T364921: Commons Impact Metrics: Data Gateway endpoints.
Thu, Jun 20, 6:34 PM · Data Products, Cassandra, serviceops, Service-deployment-requests, SRE
Scott_French closed T365123: Make dbctl check for depooled future masters as Resolved.

Alright, this should all be complete. I'll follow up in T367921 about the build / release instructions, and keep an eye on T367598 for the return of elastic2099. Many thanks to @Joe for already having taken care of T367919.

Thu, Jun 20, 4:35 PM · Patch-For-Review, Infrastructure-Foundations, Data-Persistence, conftool
Scott_French triaged T367921: Improve build / release procedure for conftool as Low priority.
Thu, Jun 20, 4:35 PM · conftool
Scott_French claimed T367921: Improve build / release procedure for conftool.
Thu, Jun 20, 4:34 PM · conftool
Scott_French closed T365123: Make dbctl check for depooled future masters , a subtask of T362786: Enable dbctl for parsercache, as Resolved.
Thu, Jun 20, 4:34 PM · Patch-For-Review, Infrastructure-Foundations, Data-Persistence, conftool

Wed, Jun 19

Scott_French committed rOSCTb06ae2cc1ca7: drivers/etcd: only attempt to load existing configs.
drivers/etcd: only attempt to load existing configs
Wed, Jun 19, 6:26 AM
Scott_French updated the task description for T364921: Commons Impact Metrics: Data Gateway endpoints.
Wed, Jun 19, 1:40 AM · Data Products, Cassandra, serviceops, Service-deployment-requests, SRE

Tue, Jun 18

Scott_French added a comment to T365123: Make dbctl check for depooled future masters .

Forgot to mention: there's still one straggler on 2.3.3 per debmonitor: elastic2099 which down (T367598).

Tue, Jun 18, 8:38 PM · Patch-For-Review, Infrastructure-Foundations, Data-Persistence, conftool
Scott_French created T367921: Improve build / release procedure for conftool.
Tue, Jun 18, 8:27 PM · conftool
Scott_French created T367919: Avoid error logging while searching configs during normal operation.
Tue, Jun 18, 8:08 PM · Data-Persistence, conftool
Scott_French added a comment to T365123: Make dbctl check for depooled future masters .

Well, that was a blast!

Tue, Jun 18, 8:07 PM · Patch-For-Review, Infrastructure-Foundations, Data-Persistence, conftool

Mon, Jun 17

Scott_French updated the task description for T364921: Commons Impact Metrics: Data Gateway endpoints.
Mon, Jun 17, 11:31 PM · Data Products, Cassandra, serviceops, Service-deployment-requests, SRE
Scott_French added a comment to T364921: Commons Impact Metrics: Data Gateway endpoints.

From T366851: We now understand the slow-client-startup issue to be the result of connection timeouts when new(er) versions of gocql attempt to connect to the full Cassandra cluster, while network policies prevent cross-DC connectivity. After some discussion about anticipated cross-DC demand, the latter have been relaxed and we should be good to remove the initialDelaySeconds.

Mon, Jun 17, 11:28 PM · Data Products, Cassandra, serviceops, Service-deployment-requests, SRE
Scott_French added a comment to T361835: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production.

@SGupta-WMF - Thanks for letting me know. Given your CI configuration [0], if you could please tag a commit that exists in main (I'm assuming that's your only protected branch) as your initial release, that will produce an image we can use to move forward with turnup in staging.

Mon, Jun 17, 10:48 PM · Data Products (Data Products Sprint 16), Patch-For-Review, serviceops, Service-deployment-requests, SRE
Scott_French added a comment to T366851: gocql startup times have increased between v1.2.0 and v1.6.0.

The network policy and client configuration changes were applied over the course of a 90m window earlier today.

Mon, Jun 17, 8:38 PM · Cassandra

Fri, Jun 14

Scott_French added a comment to T352245: Migrate etcd::tlsproxy Nginx certs and etcd itself to PKI.

Thanks, @MoritzMuehlenhoff - you're absolutely right that we could decouple these.

Fri, Jun 14, 9:35 PM · serviceops
Scott_French added a comment to T366932: Alerting on under-scaled deployments.

Alert works:

Fri, Jun 14, 6:37 PM · serviceops

Thu, Jun 13

Scott_French added a comment to T365123: Make dbctl check for depooled future masters .

While the only significant functional changes in this release are for dbctl and requestctl (installed to a small number of hosts), I of course still need to update all hosts that have conftool installed.

Thu, Jun 13, 10:39 PM · Patch-For-Review, Infrastructure-Foundations, Data-Persistence, conftool
Scott_French added a comment to T366932: Alerting on under-scaled deployments.

Next steps:

  • Let this soak for a bit and check the noise level of KubernetesDeploymentUnavailableReplicas [0] (currently warning, so it matches no receivers).
  • Follow up with o11y about patterns for alert duplication across teams - i.e., currently this is applied to k8s-mlserve as well, but those should ideally be routed to team: ml rather than sre.
  • Tune if needed and (eventually) promote to critical.
Thu, Jun 13, 9:10 PM · serviceops
Scott_French added a comment to T366851: gocql startup times have increased between v1.2.0 and v1.6.0.

After chatting with I/F folks about the ~ 480Kbps cross-DC flow, it seems this is small enough not to worry about.

Thu, Jun 13, 7:06 PM · Cassandra
Scott_French added a comment to T365123: Make dbctl check for depooled future masters .

Alas, this is unlikely to happen this week within scheduling constraints. Now aiming for Tuesday the 18th (next week) starting at 14:00 UTC. I'll advertise this in IRC on Monday and again Tuesday.

Thu, Jun 13, 5:51 PM · Patch-For-Review, Infrastructure-Foundations, Data-Persistence, conftool

Jun 12 2024

Scott_French added a comment to T366851: gocql startup times have increased between v1.2.0 and v1.6.0.

Well, nothing says "progress" like a revert.

Jun 12 2024, 9:52 PM · Cassandra

Jun 7 2024

Scott_French added a comment to T366851: gocql startup times have increased between v1.2.0 and v1.6.0.

Cool, this is probably a good opportunity to verify / fix the use of the datacentre values too (i.e., local_dc).

Jun 7 2024, 9:31 PM · Cassandra
Scott_French added a comment to T366851: gocql startup times have increased between v1.2.0 and v1.6.0.

Nice find @brouberol and nice test @Eevans!

Jun 7 2024, 8:57 PM · Cassandra
Scott_French changed the status of T366932: Alerting on under-scaled deployments from Open to In Progress.
Jun 7 2024, 8:11 PM · serviceops
Scott_French created T366932: Alerting on under-scaled deployments.
Jun 7 2024, 5:37 PM · serviceops

Jun 6 2024

Scott_French added a comment to T365123: Make dbctl check for depooled future masters .

Packages for buster, bullseye, and bookworm are ready, and have been copied over to apt1002.

Jun 6 2024, 10:52 PM · Patch-For-Review, Infrastructure-Foundations, Data-Persistence, conftool
Scott_French updated the task description for T364921: Commons Impact Metrics: Data Gateway endpoints.
Jun 6 2024, 9:08 PM · Data Products, Cassandra, serviceops, Service-deployment-requests, SRE
Scott_French added a comment to T364921: Commons Impact Metrics: Data Gateway endpoints.

Many thanks to @Eevans for humoring my experiments.

Jun 6 2024, 9:08 PM · Data Products, Cassandra, serviceops, Service-deployment-requests, SRE
Scott_French committed rOSCTde9ff969785d: Release 3.0.0.
Release 3.0.0
Jun 6 2024, 4:55 PM
Scott_French closed T365356: Document new parsercache failover process as Resolved.

Based on the feedback so far, I believe we're done with the documentation changes then, so I'll resolve this :)

Jun 6 2024, 4:41 PM · DBA
Scott_French closed T365356: Document new parsercache failover process, a subtask of T362786: Enable dbctl for parsercache, as Resolved.
Jun 6 2024, 4:41 PM · Patch-For-Review, Infrastructure-Foundations, Data-Persistence, conftool
Scott_French added a comment to T364921: Commons Impact Metrics: Data Gateway endpoints.

Adding an initialDelaySeconds (30) seems to have done the trick.

Jun 6 2024, 3:46 PM · Data Products, Cassandra, serviceops, Service-deployment-requests, SRE
Scott_French updated subscribers of T364921: Commons Impact Metrics: Data Gateway endpoints.

I just had a very interesting conversation with @Sfaci about the initialDelaySeconds recently added to AQS 2.0 services.

Jun 6 2024, 3:21 PM · Data Products, Cassandra, serviceops, Service-deployment-requests, SRE
Scott_French added a comment to T364921: Commons Impact Metrics: Data Gateway endpoints.

Thanks for taking a look, @hnowlan!

Jun 6 2024, 2:54 PM · Data Products, Cassandra, serviceops, Service-deployment-requests, SRE

Jun 5 2024

Scott_French added a comment to T364921: Commons Impact Metrics: Data Gateway endpoints.

The service is turned up in staging and was verified against the commons impact metrics dataset present in cassandra staging at the time (subsequently dropped to facilitate T364583).

Jun 5 2024, 11:09 PM · Data Products, Cassandra, serviceops, Service-deployment-requests, SRE
Scott_French updated the task description for T364921: Commons Impact Metrics: Data Gateway endpoints.
Jun 5 2024, 10:48 PM · Data Products, Cassandra, serviceops, Service-deployment-requests, SRE
Scott_French added a comment to T365356: Document new parsercache failover process.

Many thanks for reviewing, @Marostegui and @ABran-WMF. Also, neat idea writing a script to automate the steps, Arnaud!

Jun 5 2024, 5:06 PM · DBA

Jun 4 2024

Scott_French added a comment to T365356: Document new parsercache failover process.

Thanks for taking a look, @Marostegui!

Jun 4 2024, 9:15 PM · DBA
Scott_French updated the task description for T361835: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production.
Jun 4 2024, 7:38 PM · Data Products (Data Products Sprint 16), Patch-For-Review, serviceops, Service-deployment-requests, SRE
Scott_French added a comment to T361835: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production.

Thanks for the update, @SGupta-WMF - that's great!

Jun 4 2024, 7:37 PM · Data Products (Data Products Sprint 16), Patch-For-Review, serviceops, Service-deployment-requests, SRE

Jun 3 2024

Scott_French added a comment to T361835: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production.

Added k8s secret for the commons_impact_analytics role to private puppet in 19d63430dbe1d6f4651bab4fe4162bbf3462e97e.

Jun 3 2024, 7:08 PM · Data Products (Data Products Sprint 16), Patch-For-Review, serviceops, Service-deployment-requests, SRE

May 31 2024

Scott_French updated the task description for T359423: Migrate charts to Calico Network Policies.
May 31 2024, 10:48 PM · Patch-For-Review, Data-Platform-SRE, Prod-Kubernetes, Kubernetes, serviceops
Scott_French added a comment to T365356: Document new parsercache failover process.

All three items have been updated. Two points of note:

May 31 2024, 6:41 PM · DBA