User Details
- User Since
- Oct 8 2014, 5:48 PM (509 w, 2 d)
- Availability
- Available
- IRC Nick
- Milimetric
- LDAP User
- Milimetric
- MediaWiki User
- Milimetric (WMF) [ Global Accounts ]
Yesterday
Ok, sent updated code, it's fast now due to a CACHE statement, but that doesn't change the query plan which is still absolutely nuts, check this out:
Quick spark-sql query to get link changes where someone tags a new wiki project on the talk page:
Thu, Jul 11
OK, so it seems most problems do indeed track back to not applying delete and restore events. It feels like we can mark this task complete. We can find a way to apply delete/restore/merge, and then run these queries again and see what we need to reconcile. The period I looked at above was 10 days of enwiki revisions. If anyone disagrees, do move this task back.
That one mismatch_page that has no other reason listed is apparently part of a merge, so if we're not following up on delete/restore properly then this makes perfect sense because merges are more complicated still. Here are the two pages involved and the logging table records for them:
Ok, I think I got this query to make sense... the results:
Wed, Jul 10
I am still trying to find an elegant way to change the queries and show all this, but I just wanted to share results so far:
Apologies for the week delay here, I was out sick, picking it back up soon.
Tue, Jul 9
Wed, Jul 3
Quick summary of last meeting. Luke started working on a draft of what we were talking about (see the reconciliation flow on https://miro.com/app/board/uXjVNfaohl0=/).
Mon, Jul 1
My first hunch, that the revisions were coming from only specific pages, is wrong:
Wed, Jun 26
are we literally saying that we should just change the value of statistics-users-active to Editors? Code here: https://gerrit.wikimedia.org/g/mediawiki/core/+/1aa990f1725bf81caaf44527b9e778b5a8fe7e4d/languages/i18n/en.json#1950
Thanks for pinging us, we don't use abuse filter tables anywhere I'm aware of, so this shouldn't affect us.
Thanks for pinging us on this. The sqoop code should run without modification, so we're good downstream. Thank you!
obligatory reference: https://www.mediawiki.org/wiki/Extension:NavigationTiming (is this roughly related?)
Tue, Jun 25
This is now done.
Great question, @mforns. This was mostly for performance reasons. I couldn't find a way to get Spark to optimally work on the full day of pageviews without first aggregating it like this to > 250. But the execution plan I ended up with looks pretty wild. Let's talk tomorrow when you have some time. I'm attaching the change here.
Mon, Jun 24
I've migrated and shut off the old instances. I will delete them in a couple of days, just in case. But everything's working fine without them. Did not know about the wmflabs -> wmcloud automatic redirect, that made everything very simple.
I grouped a couple of tasks under this so we're less likely to lose them in the fray.
Fri, Jun 21
The simpler way to do this, just two phases as opposed to progressive, gets us fairly similar results, with about 200 fewer rows which are all detailing specific browser versions.
We get a ton more detailed results this way, and the total coverage increases to 99.7%. Still not 99.9%, but I think we may have too much detail at some point. I'm fairly happy with these results, and I'm going to prepare the new browser general query as a gerrit change. It'll be good to get some review.
This might affect some data we sqoop into HDFS and some of how we compute commons impact metrics or similar future metrics. We have to wait until a schema change is proposed to know for sure.
From a discussion with @Krinkle about the data, a preliminary idea of how to roll up is:
Thu, Jun 20
The long and the short of it is that we can get that "other" to about 2% if we simply roll up remaining data by browser family and os family. We could get fancier but let's see what folks think about just this approach.
Ok, so these rows represent more than 0.1% of all views for this day and they're aggregated all the way down, so this is what we were dumping before, followed by a big "Other" bucket:
ok, I have some results for us to peruse, from rolling up in different ways. First of all, my query so we can debate whether or not it's accurate.
Mon, Jun 17
Analysis available in this spreadsheet: https://docs.google.com/spreadsheets/d/1iSlH5XsRXV7mDoku0F5HbLNJmx1CMBm6ECakZMPUbU8/edit?usp=sharing
Made up some slides to help think about this data:
Jun 12 2024
Jun 11 2024
This looks like a great system to get started with. I can think of some potential snags that come up, so as we build it let's keep an eye out for these and similar:
Jun 7 2024
select day, http_status, count(*) count_by_status from pageview_actor where year=2024 and month=4 and day in (19,26) and geocoded_data['country_code'] = 'HK' and normalized_host.project_class = 'wikifunctions' group by day, http_status
day | http_status | count_by_status |
19 | 200 | 389 |
19 | 301 | 4311 |
19 | 302 | 931 |
26 | 200 | 198 |
26 | 301 | 133801 |
26 | 302 | 1028 |
Jun 6 2024
Jun 5 2024
Jun 4 2024
https://mpic.svc.eqiad.wmnet:30443/ is the endpoint we should use to talk to the mpic service making sure we don't take unnecessary hops through DNS.
May 31 2024
hypothesis so far: maybe some workers are getting MaxMind updates on a staggered schedule from others, so there's always some variation?
(and sub-country it's much worse)
May 29 2024
The two instances have been moved, docs updated on wiki and in code, and proxies have been moved. The only problem is the new proxies can't use the old wmflabs.org domain. For now, I left the old proxies up and additionally set up the new proxies. So, for example, both https://pingback.wmflabs.org/ and https://pingback.wmcloud.org/ work. Whenever the old instances are deleted, the old URLs will stop working. I guess part of sign-off will be to communicate this and maybe delete the old instances?
Keeping track of how I do this for future reference. (The previous task where I did this was T236586 and I failed to take good notes there)
May 23 2024
https://gitlab.wikimedia.org/repos/data-engineering/mpic/-/merge_requests/31 is including both the Menu Button (sorry I couldn't find a task for this) and the Multi-lookup, as well as using them from one field each. I'm happy to help with this more when I'm back next week.
May 16 2024
quick update: resolved with Eric to work on this as a separate component. Will start on a patch now, keeping it in the Codex sandbox for now with T363432 as the goal.
May 15 2024
May 9 2024
May 1 2024
ok, I didn't do much here, just provided a very short description and detailed out the schemas as Marcel had them in the design doc. Please let me know if anyone was imagining something else.
filling out the readme right now, thanks Ben!
I am not sure this is 100% squashed because the behavior is so weird. Here's what I found, in short:
Apr 18 2024
It would be cool to do a quick spike into Scalar and the customization we'd need there. Abstain as a voter here, I like all the options just fine and I have bad aesthetics when it comes to reading docs because I just start hacking and see what happens :)
+1 for Option 2. For what it's worth, when we initially put up the endpoint docs on wikitech we were just doing so while we waited for a better end user experience than the swagger UI afforded us. I especially like the integration with wikitech described in option 2 (the discovery pages that would lead wiki users to the docs)
Apr 16 2024
+1, SSR is kind of a pain if done in fancier ways, but in this way you get a lot for free and it helps even reduce code. As a bonus the user gets a great experience.
Apr 15 2024
I've broken this down into subtasks but I'm keeping it as something between an epic and an actual task. It's coordinating and has all the acceptance criteria, it was just too big. So I'll leave the other two subtasks on the boards while I'm on vacation and put this in paused. This can be resumed whenever you'd like to continue work on coordinating and deployment.