Investigate table schema / DB abstraction layer for section/comment watching/unwatching
Closed, ResolvedPublic

Description

Notes

  • Create table with user/title/sectionname
  • TBD: Expiration date of entries
  • Build an actual table schema based on this and the watchlist table (or Echo?)
  • Create interfaces (services, records?)
    • Watch this section [1]
    • Unwatch this section [1]
    • Check if I'm watching this section [1]
    • List all users watching this section
  1. Should also have a public-facing API at some point

Could create a special page to list all watched threads. Maybe not required for v1?

Done

  • Once this ticket is resolved, work on T263817 should be prioritized.

Related Objects

Event Timeline

Having never had to really interact with table structure in mediawiki before, here's what I've worked out:

Mediawiki extension SQL tables

  1. Create a SQL patch file containing all your CREATE TABLE / INDEX statements
  2. In extension.json hook LoadExtensionSchemaUpdates( DatabaseUpdater $updater )
  3. That hook should call $updater->addExtensionTable( 'your_table_name', $pathToYourSQLFile );
  4. Make sure that update.php is run, which should create the table from that SQL
    • I think this is part of the regular train deploy, but might need to double-check
    • later tagging in of DBA to review the schema once we're ready would confirm that

If you're doing something complicated, $updater->getDB()->getType() can be used to provide database-specific SQL patch files. Core code is supposed to support MySQL, Postgres, and SQLite, but it looks like extensions might get some leeway. The production servers are running MySQL / MariaDB.

In the future, for any changes, be sure to follow the production schema update guide for added scrutiny. There are a lot of functions in DatabaseUpdater which can be used to describe a change you're going to make to your schema, which will run an appropriate SQL patch file if that change looks like it's not yet made.

DiscussionTools specific notification schema

This is a loosely sketched schema:

CREATE TABLE /*_*/discussiontools_subscription (
	-- Page ID of the page the user is subscribing to
	sub_page_id int unsigned not null,
	-- Reference to the specific section being subscribed to
	sub_page_section varchar(64) binary not null,
	-- Key to user_id
	sub_user int unsigned not null,
	-- Timestamp for when the subscription was added
	sub_timestamp binary(14) not null,
	PRIMARY KEY (sub_page_id, sub_page_section, sub_user)
) /*$wgDBTableOptions*/;

-- Index to get the user list to notify given a change to a specific section
CREATE INDEX /*i*/discussiontools_subscription_users_to_notify ON /*_*/discussiontools_subscription (sub_page_id, sub_page_section);
-- Index to find all subscriptions for a given page
CREATE INDEX /*i*/discussiontools_subscription_page_id ON /*_*/discussiontools_subscription (sub_page_id);

Thoughts:

  • I'm not sure how long a section name we should support -- though I guess we could just hash it into the varchar if need be.
  • Schema has a creation date, but should probably have an index on that if we want to list the users in that order. It feels a bit awkward to wind up with a table with an index that encompasses every column, though.
  • Expiration date: custom or fixed? If fixed, the existing creation date effectively counts.
  • If expiring, do we want to clean up the table of those entries, or leave them around?

Strongly related is the Echo schema.

I'm not sure how long a section name we should support -- though I guess we could just hash it into the varchar if need be.

For comment IDs we should be able to calculate a limit of 2 * ( max username length + timestamp length + spacers )

For heading IDs, there might be a limit somewhere on the length of the heading ID hash attribute (possibly because it would generate too-long URLs otherwise)?

Schema has a creation date, but should probably have an index on that if we want to list the users in that order.

I don't know what the best practice is here, but if we don't have a use case, maybe we should not add the index yet.

Expiration date: custom or fixed? If fixed, the existing creation date effectively counts.

I think for version 1 we can ignore expiration.

If expiring, do we want to clean up the table of those entries, or leave them around?

We should see what is happening with watchlist expiration

	-- Page ID of the page the user is subscribing to
	sub_page_id int unsigned not null,

I think this would be better as a page_namespace + page_title pair. page_id can change when a page is deleted and restored. Also, if the page is archived by moving it (https://en.wikipedia.org/wiki/Help:Archiving_a_talk_page/Other_procedures), the page_id will now refer to the archive, which is probably undesirable.

	-- Key to user_id
	sub_user int unsigned not null,

Might be better to make this a key to actor_id, in case we ever implement notifications for logged-out users?

@matmarex So, my logic on both of those fields was to look at the echo_event table and just copy how they were doing it. However, I think you're right that our case differs. Instead we should be copying watchlist:

CREATE TABLE /*_*/watchlist (
  wl_id int unsigned NOT NULL PRIMARY KEY AUTO_INCREMENT,
  -- Key to user.user_id
  wl_user int unsigned NOT NULL,

  -- Key to page_namespace/page_title
  -- Note that users may watch pages which do not exist yet,
  -- or existed in the past but have been deleted.
  wl_namespace int NOT NULL default 0,
  wl_title varchar(255) binary NOT NULL default '',

  -- Timestamp used to send notification e-mails and show "updated since last visit" markers on
  -- history and recent changes / watchlist. Set to NULL when the user visits the latest revision
  -- of the page, which means that they should be sent an e-mail on the next change.
  wl_notificationtimestamp varbinary(14)

) /*$wgDBTableOptions*/;

@DLynch This looks good. Let's start on the "Create interfaces" part of this with the schema you and Bartosz described. We can tweak it as we go if necessary.

The combined patch file:

CREATE TABLE /*_*/discussiontools_subscription (
	-- Key to page_namespace/page_title
	-- Note that users may watch pages which do not exist yet,
	-- or existed in the past but have been deleted.
	sub_namespace int NOT NULL default 0,
	sub_title varchar(255) binary NOT NULL default '',	-- Reference to the specific section being subscribed to
	sub_section varchar(64) binary not null,
	-- Key to user_id
	sub_user int unsigned not null,
	-- Timestamp for when the subscription was added
	sub_timestamp varbinary(14) not null,
	PRIMARY KEY (sub_namespace, sub_title, sub_section, sub_user)
) /*$wgDBTableOptions*/;

-- Index to get the user list to notify given a change to a specific section
CREATE INDEX /*i*/discussiontools_subscription_users_to_notify ON /*_*/discussiontools_subscription (sub_namespace, sub_title, sub_section);
-- Index to find all subscriptions for a given page
CREATE INDEX /*i*/discussiontools_subscription_page ON /*_*/discussiontools_subscription (sub_namespace, sub_title);

I sort of feel that PRIMARY KEY (sub_namespace, sub_title, sub_section, sub_user) might be a bit much. That said, we'll definitely need that as an index at least, so it's maybe reasonable. Can let DBA comment on whether an auto-increment sub_id would be better later. 🤷🏻‍♂️

I think an auto-incrementer would be better, and is what watchlist is using. It has benefits for paging and possibly sharding.

Change 659080 had a related patch set uploaded (by Esanders; owner: Esanders):
[mediawiki/extensions/DiscussionTools@master] WIP Subscribe API

https://gerrit.wikimedia.org/r/659080

Change 660471 had a related patch set uploaded (by DLynch; owner: DLynch):
[mediawiki/extensions/DiscussionTools@master] WIP Create table for subscribing to sections

https://gerrit.wikimedia.org/r/660471

Change 664498 had a related patch set uploaded (by DLynch; owner: DLynch):
[mediawiki/extensions/DiscussionTools@master] WIP: Service to interact with topic subscriptions

https://gerrit.wikimedia.org/r/664498

24 February team meeting notes
Today, we met as a team to discuss what approaches are available for creating unique topic identifiers. The notes from that conversation, the conversation Ed and I had on 23 February, and the resulting actions from both, are below.


Actions

  • @ppelberg to consult with volunteers at Talk:Talk_pages_project/Notifications to learn about edge cases we ought to consider when deciding upon an approach to creating unique topic identifiers.
  • @Esanders: can you share more about the idea of enhancing signatures so they are precise to the second rather than the minute? This might exist in a ticket already, I'm just not sure which one.

Notes
Situation

  • For people to receive notifications when new comments are posted in a given topic, each topic needs to be assigned a unique identifier.
  • The "ingredients" we have available to make those unique identifiers are as follows:
    • (1) Page title
    • (2) Topic title
    • (3) Author of the topic's username
    • (4) The time the topic was published (to the nearest minute)
  • Broadly speaking, there are two approaches we can take for using the above "ingredients" to create unique identifiers.

Approaches

  • Approach A): use 1, 2 and 3 and/or 4.
    • Values:
      • Reliability at the expense of flexibility.
    • Advantages:
      • Low risk of people missing notifications or receiving notifications they are not expecting.
    • Disadvatnages:
      • Limits our ability to offer editors comment permalinking (T275729)
    • Failure case(s)
      • Subscriptions will break when a topic is renamed and/or moved.
  • Approach B): use 3 and 4 and decouple topics [and comments in the future] to be decoupled from the pages on which they exist.
    • Values:
      • Flexibility at the expense of reliability.
    • Advantages:
      • Subscriptions remain intact when a topic is renamed and/or moved.
    • Disadvantages:
      • Potentially, less reliable in the interim
    • Failure cases:
      • If Person A starts multiple topics within the same minute, it's possible for someone who subscribed to one of those topics (call them Person B) to start and stop receiving notifications without them explicitly deciding for this to happen.
        • Examples: RfCs and en.wiki's Request for adminship process where multiple [sub-] sections are often created within the same edit to help structured the conversation (see: /TJMSmith). //Note: this may not be problematic considering topic subscriptions are currently limited to ==H2==.
    • Considerations
      • There is the potential to reduce the likelihood of the "failure case" above by adjusting signatures so they are precise to the second rather than to the minute, as they are currently.

Actions

In progress; see: https://w.wiki/32eJ.

I think we should move forward with the approach that allows topics, and comments in the future, to be decoupled from the pages on which they exist.

Reasons:

  1. We think this approach will result in fewer false negatives (notifications not being sent when people expect them to be). [i]
    • Underlying assumption: it will be more likely for someone to change the name of a section (thereby breaking existing subscriptions were we to take "Approach A)") than it will be for someone to create two == H2 == level sections in the same minute.
  2. This approach will enable us to offer/expose links to topics (T273341) and comments (T274685) that remain functional even when said topics and comments are moved to different pages (read: archived).
  3. Assuming I'm accurately understanding what @Esanders alluded to in the meeting we had on 24-February (see: T264885#6859503), this approach enables us to offer editors the ability to, at some not-yet-defined point in time, to make topic subscriptions more reliable/resilient by making comment signatures precise to the second.

i. Note: this approach could result in a greater number of false positives (notifications being sent when people do NOT expect them to be). An example of the kind of "false positive" the "decoupling approach" could cause is: "Person A starting Topic 1 and Topic 2 within the same minute. Person B subscribing to a Topic 1 and, unknowingly, becoming subscribed to Topic 2." We are okay with accepting this possibility considering if it were to occur, there are steps we can take to address it (see "3." above).

Next steps

  • 1. Engineering to review and merge stack
  • 2. Engineering to ping DBA with ask for review
    • 2a. @ppelberg to link to DBA review ticket [if it exists]

Next steps

  • 1. Engineering to review and merge stack
  • 2. Engineering to ping DBA with ask for review
    • 2a. @ppelberg to link to DBA review ticket [if it exists]

DBA review to happen in T263817: DBA review: conversation subscriptions.

Change 660471 merged by jenkins-bot:

[mediawiki/extensions/DiscussionTools@master] Create table for topic subscriptions

https://gerrit.wikimedia.org/r/660471

Change 664498 merged by jenkins-bot:

[mediawiki/extensions/DiscussionTools@master] Service to interact with topic subscriptions

https://gerrit.wikimedia.org/r/664498

Change 659080 merged by jenkins-bot:

[mediawiki/extensions/DiscussionTools@master] Topic subscription action API

https://gerrit.wikimedia.org/r/659080