Content Tagging Models: Prototype two
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Isaac
	Oct 15 2021, 3:05 PM

Description

Goal: prototype two content tagging models. Prototype is an ambiguous term and the different models are at different stages. But the goal is each will have a fully-working language-agnostic model and plan for additional improvements / language-specific tweaks.

These will be:

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Open		Isaac	T293478 Content Tagging Models
		Resolved		Isaac	T293480 Content Tagging Models: Prototype two

Event Timeline

Isaac created this task.Oct 15 2021, 3:05 PM

Isaac moved this task from Backlog to FY2021-22-Research-Jan-March on the Research board.

Isaac edited projects, added Research (FY2021-22-Research-Jan-March); removed Research.

Isaac moved this task from FY2021-22-Research-Jan-March to FY2021-22-Research-Oct-Dec on the Research board.

Isaac edited projects, added Research (FY2021-22-Research-Oct-Dec); removed Research (FY2021-22-Research-Jan-March).

Weekly update:

Ongoing discussion with MM about the design of a quality model and how it works well / doesn't work well for the knowledge gaps metrics use-case.

Isaac mentioned this in T286753: Content Tagging Models: Define Scope.Oct 27 2021, 8:03 PM

Isaac updated the task description. (Show Details)

Weekly update:

No progress this week

Weekly update: talked with Miriam and this work will move more slowly for the rest of the quarter while I focus on some other projects but pick back up in Q3

Weekly update: began work on expanding feature set in quality model per Meta page: https://meta.wikimedia.org/wiki/Research:Prioritization_of_Wikipedia_Articles/Language-Agnostic_Quality

Weekly update: no progress

Weekly update: will be prioritizing in January

Isaac moved this task from FY2021-22-Research-Oct-Dec to FY2021-22-Research-Jan-March on the Research board.Jan 12 2022, 2:34 PM

Isaac edited projects, added Research (FY2021-22-Research-Jan-March); removed Research (FY2021-22-Research-Oct-Dec).

Weekly update:

Gathered new groundtruth data from Arabic/French/English Wikipedia for article quality to test / improve model.
Working with Growth on using some of the preliminary geography data for an edit-a-thon.
Updated the geographic (and gender) data snapshot for some other projects (and in doing so, verified that still working well)

Weekly updates:

updated quality model features (more features and make sure still could run simultaneously on all wikipedia language editions)
continued reimplementation of data pipeline for quality model to support evaluation data from multiple languages

Weekly updates:

Made some additional progress that I feel good about so closing out this task to create a new Q3-specific task
Switched the quality model to purely using wikitext and not links tables -- this will allow us to apply it to historical wikipedia revisions easily and actually probably speeds up / simplifies the data pipeline because now there is one source of data that is processed once (wikitext) as opposed to many different tables that need to be joined etc.
Waiting to hear from Newcomer Pilot folks about geography model

Isaac updated the task description. (Show Details)Jan 27 2022, 8:23 PM

Hello Issac, @diego,

I have the following queries for the model V2-

How are the features (e.g., page length, references, etc.) weighted in the model? Further, have they been computed on the basis of all wikis? or some specific wikis?

How did you set the minimum wiki thresholds? I understand that these are determined empirically, but is it considering all the wikis? or a few selective ones?

-Best,
Paramita

Hey @paramita_das: you can see most of these details in the write-up and attached notebook. Pointers to your specific questions below:

How are the features (e.g., page length, references, etc.) weighted in the model?

The exact weights are in the meta page and predictQuality function in the notebook but they were derived from a groundtruth dataset from English, French, and Arabic articles. I don't think I have that notebook public anywhere at the moment but it's based on a sample of articles from that were rated for quality in the last month with a little bit of balancing so English doesn't dominate the sample.

How did you set the minimum wiki thresholds? I understand that these are determined empirically, but is it considering all the wikis? or a few selective ones?

This was based on eye-balling the data for all wikis. You can see some comments on this where the minimum threholds are set in the notebook and then later on the raw data is under a cell labeled Data to help in setting min thresholds if you want to get a sense of the practical impact of these thresholds.

Content Tagging Models: Prototype twoClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Content Tagging Models: Prototype two
Closed, ResolvedPublic
Actions

Related Objects
Search...