⚓ T264885 Investigate table schema / DB abstraction layer for section/comment watching/unwatching

Subject	Repo	Branch	Lines +/-
Topic subscription action API	mediawiki/extensions/DiscussionTools	master	+131 -3
Service to interact with topic subscriptions	mediawiki/extensions/DiscussionTools	master	+424 -0
Create table for topic subscriptions	mediawiki/extensions/DiscussionTools	master	+209 -0

Status	Assigned	Task
Open	None	T276497 Scale DiscussionTools to all projects
Open	None	T251207 [Epic] Scale DiscussionTools to all Wikipedias
Resolved	ppelberg	T284489 Deploy topic subscriptions (desktop) as opt-out feature at all projects
Resolved	ppelberg	T284490 Offer topic subscriptions as a Beta Feature at all projects
Resolved	ppelberg	T287276 Offer manual topic subscriptions as Beta Feature at all projects
Open	None	T233447 [OKR] Increase response rates
Open	None	T273920 [Interventions] Notifications
Open	None	T263820 [Intervention] Manual topic subscriptions
Resolved	• LZaman	T274188 Offer manual topic subscriptions as a Beta Feature at partner wikis
Resolved	ppelberg	T274216 Build topic subscriptions v1.0
Resolved	• Esanders	T264885 Investigate table schema / DB abstraction layer for section/comment watching/unwatching

• Esanders created this task.Oct 7 2020, 2:00 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 7 2020, 2:00 PM

• Esanders moved this task from Incoming to Ready to Be Worked On on the Editing-team (Kanban Board) board.Oct 7 2020, 2:00 PM

Having never had to really interact with table structure in mediawiki before, here's what I've worked out:

Mediawiki extension SQL tables

Create a SQL patch file containing all your CREATE TABLE / INDEX statements
In extension.json hook LoadExtensionSchemaUpdates( DatabaseUpdater $updater )
That hook should call $updater->addExtensionTable( 'your_table_name', $pathToYourSQLFile );
Make sure that update.php is run, which should create the table from that SQL
- I think this is part of the regular train deploy, but might need to double-check
- later tagging in of DBA to review the schema once we're ready would confirm that

If you're doing something complicated, $updater->getDB()->getType() can be used to provide database-specific SQL patch files. Core code is supposed to support MySQL, Postgres, and SQLite, but it looks like extensions might get some leeway. The production servers are running MySQL / MariaDB.

In the future, for any changes, be sure to follow the production schema update guide for added scrutiny. There are a lot of functions in DatabaseUpdater which can be used to describe a change you're going to make to your schema, which will run an appropriate SQL patch file if that change looks like it's not yet made.

DiscussionTools specific notification schema

This is a loosely sketched schema:

CREATE TABLE /*_*/discussiontools_subscription (
	-- Page ID of the page the user is subscribing to
	sub_page_id int unsigned not null,
	-- Reference to the specific section being subscribed to
	sub_page_section varchar(64) binary not null,
	-- Key to user_id
	sub_user int unsigned not null,
	-- Timestamp for when the subscription was added
	sub_timestamp binary(14) not null,
	PRIMARY KEY (sub_page_id, sub_page_section, sub_user)
) /*$wgDBTableOptions*/;

-- Index to get the user list to notify given a change to a specific section
CREATE INDEX /*i*/discussiontools_subscription_users_to_notify ON /*_*/discussiontools_subscription (sub_page_id, sub_page_section);
-- Index to find all subscriptions for a given page
CREATE INDEX /*i*/discussiontools_subscription_page_id ON /*_*/discussiontools_subscription (sub_page_id);

Thoughts:

I'm not sure how long a section name we should support -- though I guess we could just hash it into the varchar if need be.
Schema has a creation date, but should probably have an index on that if we want to list the users in that order. It feels a bit awkward to wind up with a table with an index that encompasses every column, though.
Expiration date: custom or fixed? If fixed, the existing creation date effectively counts.
If expiring, do we want to clean up the table of those entries, or leave them around?

Strongly related is the Echo schema.

I'm not sure how long a section name we should support -- though I guess we could just hash it into the varchar if need be.

For comment IDs we should be able to calculate a limit of 2 * ( max username length + timestamp length + spacers )

For heading IDs, there might be a limit somewhere on the length of the heading ID hash attribute (possibly because it would generate too-long URLs otherwise)?

Schema has a creation date, but should probably have an index on that if we want to list the users in that order.

I don't know what the best practice is here, but if we don't have a use case, maybe we should not add the index yet.

Expiration date: custom or fixed? If fixed, the existing creation date effectively counts.

I think for version 1 we can ignore expiration.

If expiring, do we want to clean up the table of those entries, or leave them around?

We should see what is happening with watchlist expiration

	-- Page ID of the page the user is subscribing to
	sub_page_id int unsigned not null,

I think this would be better as a page_namespace + page_title pair. page_id can change when a page is deleted and restored. Also, if the page is archived by moving it (https://en.wikipedia.org/wiki/Help:Archiving_a_talk_page/Other_procedures), the page_id will now refer to the archive, which is probably undesirable.

	-- Key to user_id
	sub_user int unsigned not null,

Might be better to make this a key to actor_id, in case we ever implement notifications for logged-out users?

@matmarex So, my logic on both of those fields was to look at the echo_event table and just copy how they were doing it. However, I think you're right that our case differs. Instead we should be copying watchlist:

CREATE TABLE /*_*/watchlist (
  wl_id int unsigned NOT NULL PRIMARY KEY AUTO_INCREMENT,
  -- Key to user.user_id
  wl_user int unsigned NOT NULL,

  -- Key to page_namespace/page_title
  -- Note that users may watch pages which do not exist yet,
  -- or existed in the past but have been deleted.
  wl_namespace int NOT NULL default 0,
  wl_title varchar(255) binary NOT NULL default '',

  -- Timestamp used to send notification e-mails and show "updated since last visit" markers on
  -- history and recent changes / watchlist. Set to NULL when the user visits the latest revision
  -- of the page, which means that they should be sent an e-mail on the next change.
  wl_notificationtimestamp varbinary(14)

) /*$wgDBTableOptions*/;

@DLynch This looks good. Let's start on the "Create interfaces" part of this with the schema you and Bartosz described. We can tweak it as we go if necessary.

The combined patch file:

CREATE TABLE /*_*/discussiontools_subscription (
	-- Key to page_namespace/page_title
	-- Note that users may watch pages which do not exist yet,
	-- or existed in the past but have been deleted.
	sub_namespace int NOT NULL default 0,
	sub_title varchar(255) binary NOT NULL default '',	-- Reference to the specific section being subscribed to
	sub_section varchar(64) binary not null,
	-- Key to user_id
	sub_user int unsigned not null,
	-- Timestamp for when the subscription was added
	sub_timestamp varbinary(14) not null,
	PRIMARY KEY (sub_namespace, sub_title, sub_section, sub_user)
) /*$wgDBTableOptions*/;

-- Index to get the user list to notify given a change to a specific section
CREATE INDEX /*i*/discussiontools_subscription_users_to_notify ON /*_*/discussiontools_subscription (sub_namespace, sub_title, sub_section);
-- Index to find all subscriptions for a given page
CREATE INDEX /*i*/discussiontools_subscription_page ON /*_*/discussiontools_subscription (sub_namespace, sub_title);

I sort of feel that PRIMARY KEY (sub_namespace, sub_title, sub_section, sub_user) might be a bit much. That said, we'll definitely need that as an index at least, so it's maybe reasonable. Can let DBA comment on whether an auto-increment sub_id would be better later. 🤷🏻‍♂️

I think an auto-incrementer would be better, and is what watchlist is using. It has benefits for paging and possibly sharding.

ppelberg moved this task from Ready to Be Worked On to Doing on the Editing-team (Kanban Board) board.Nov 18 2020, 4:08 PM

ppelberg subscribed.

Change 659080 had a related patch set uploaded (by Esanders; owner: Esanders):
[mediawiki/extensions/DiscussionTools@master] WIP Subscribe API

https://gerrit.wikimedia.org/r/659080

gerritbot added a project: Patch-For-Review.Jan 27 2021, 8:47 PM

Change 660471 had a related patch set uploaded (by DLynch; owner: DLynch):
[mediawiki/extensions/DiscussionTools@master] WIP Create table for subscribing to sections

https://gerrit.wikimedia.org/r/660471

DLynch mentioned this in T260371: Develop an approach for storing thread subscriptions.Feb 9 2021, 5:40 PM

DLynch mentioned this in T263817: DBA review: conversation subscriptions.

ppelberg added a parent task: T274216: Build topic subscriptions v1.0 .Feb 9 2021, 5:43 PM

ppelberg mentioned this in T274834: Make it so comments are related to one another in talk page data structure.Feb 16 2021, 1:10 AM

Change 664498 had a related patch set uploaded (by DLynch; owner: DLynch):
[mediawiki/extensions/DiscussionTools@master] WIP: Service to interact with topic subscriptions

https://gerrit.wikimedia.org/r/664498

ppelberg reassigned this task from DLynch to • Esanders.Feb 24 2021, 5:28 PM

ppelberg added a subscriber: DLynch.

ppelberg mentioned this in T275729: Comment permalinks: design and implement user experience.Feb 25 2021, 3:41 AM

24 February team meeting notes
Today, we met as a team to discuss what approaches are available for creating unique topic identifiers. The notes from that conversation, the conversation Ed and I had on 23 February, and the resulting actions from both, are below.

Actions

@ppelberg to consult with volunteers at Talk:Talk_pages_project/Notifications to learn about edge cases we ought to consider when deciding upon an approach to creating unique topic identifiers.
@Esanders: can you share more about the idea of enhancing signatures so they are precise to the second rather than the minute? This might exist in a ticket already, I'm just not sure which one.

Notes
Situation

For people to receive notifications when new comments are posted in a given topic, each topic needs to be assigned a unique identifier.
The "ingredients" we have available to make those unique identifiers are as follows:
- (1) Page title
- (2) Topic title
- (3) Author of the topic's username
- (4) The time the topic was published (to the nearest minute)
Broadly speaking, there are two approaches we can take for using the above "ingredients" to create unique identifiers.

Approaches

Approach A): use 1, 2 and 3 and/or 4.
- Values:
  - Reliability at the expense of flexibility.
- Advantages:
  - Low risk of people missing notifications or receiving notifications they are not expecting.
- Disadvatnages:
  - Limits our ability to offer editors comment permalinking (T275729)
- Failure case(s)
  - Subscriptions will break when a topic is renamed and/or moved.

Approach B): use 3 and 4 and decouple topics [and comments in the future] to be decoupled from the pages on which they exist.
- Values:
  - Flexibility at the expense of reliability.
- Advantages:
  - Subscriptions remain intact when a topic is renamed and/or moved.
- Disadvantages:
  - Potentially, less reliable in the interim
- Failure cases:
  - If Person A starts multiple topics within the same minute, it's possible for someone who subscribed to one of those topics (call them Person B) to start and stop receiving notifications without them explicitly deciding for this to happen.
    - Examples: RfCs and en.wiki's Request for adminship process where multiple [sub-] sections are often created within the same edit to help structured the conversation (see: /TJMSmith). //Note: this may not be problematic considering topic subscriptions are currently limited to ==H2==.
- Considerations
  - There is the potential to reduce the likelihood of the "failure case" above by adjusting signatures so they are precise to the second rather than to the minute, as they are currently.

In T264885#6859503, @ppelberg wrote:

Actions

@ppelberg to consult with volunteers at Talk:Talk_pages_project/Notifications to learn about edge cases we ought to consider when deciding upon an approach to creating unique topic identifiers.

In progress; see: https://w.wiki/32eJ.

matmarex mentioned this in T275881: Don't search for signatures in blockquote elements.Feb 26 2021, 8:05 PM

ppelberg mentioned this in T263820: [Intervention] Manual topic subscriptions.Mar 2 2021, 10:27 PM

I think we should move forward with the approach that allows topics, and comments in the future, to be decoupled from the pages on which they exist.

Reasons:

We think this approach will result in fewer false negatives (notifications not being sent when people expect them to be). [i]
- Underlying assumption: it will be more likely for someone to change the name of a section (thereby breaking existing subscriptions were we to take "Approach A)") than it will be for someone to create two == H2 == level sections in the same minute.
This approach will enable us to offer/expose links to topics (T273341) and comments (T274685) that remain functional even when said topics and comments are moved to different pages (read: archived).
Assuming I'm accurately understanding what @Esanders alluded to in the meeting we had on 24-February (see: T264885#6859503), this approach enables us to offer editors the ability to, at some not-yet-defined point in time, to make topic subscriptions more reliable/resilient by making comment signatures precise to the second.

i. Note: this approach could result in a greater number of false positives (notifications being sent when people do NOT expect them to be). An example of the kind of "false positive" the "decoupling approach" could cause is: "Person A starting Topic 1 and Topic 2 within the same minute. Person B subscribing to a Topic 1 and, unknowingly, becoming subscribed to Topic 2." We are okay with accepting this possibility considering if it were to occur, there are steps we can take to address it (see "3." above).

ppelberg mentioned this in T273342: Introduce a way to view and manage all of the topics you are subscribed to.Mar 5 2021, 11:36 PM

Next steps

1. Engineering to review and merge stack
2. Engineering to ping DBA with ask for review
- 2a. @ppelberg to link to DBA review ticket [if it exists]

In T264885#6961457, @ppelberg wrote:

Next steps

1. Engineering to review and merge stack

2. Engineering to ping DBA with ask for review

2a. @ppelberg to link to DBA review ticket [if it exists]