Review model performance for ptwiki 'articlequality' and 'draftquality'
Closed, ResolvedPublic

Description

Both these models have been deployed into production in their first iteration and require feedback to further improve.

We have currently devised scripts to help ptwikipedians see the predictions made by the model on any ptwiki page. These will be circulated to them via @He7d3r and @GoEThe

Event Timeline

Right now, we don't have a good way to surface draftquality. We don't have to do anything special. We could just add a draftquality option to the articlequality model and then it would show up on any page. It'll possibly be confusing when loaded up on fully fleshed out pages. I suspect we're get some weird predictions for any page beyond quality level 2 because few if any drafts look like that.

Maybe only display the draftquality if the article was created no more than X days ago? Or has no more tha Y revisions? I don't know if there is something like a "isNewArticle" flag available to us.

That sounds totally reasonable to me. I don't think there is any such flag, but I'd support any definition of "new" that you think makes sense.

Special:NewPages only lists articles under 30 days old. I would go with that.

If I'm not mistaken, we can get the earliest revision of the article using something like this:
https://pt.wikipedia.org/w/api.php?action=query&format=jsonfm&prop=revisions&list=&titles=Gita%20Ramjee&rvlimit=1&rvdir=newer
and then compare the timestamp with some (configurable) delta from the current time.

I think @GoEThe 's suggestion makes a lot of sense. That would cover a reasonable amount of articles and would still be relevant to be considered for draftquality.

That would be very handy, it would make the review process much faster for the ptwikipedians.

Also, it would be interesting to generate a list of articles where ORES quality prediction differs from the current automatic assesment provided by the local Lua module
https://pt.wikipedia.org/wiki/Module:Avalia%C3%A7%C3%A3o

One way to get the assessment would be to parse the wikitext of an invocation of the module such as

{{#invoke:Avaliação|qualidade|página=Gilching}}

by means of a query like this:
https://pt.wikipedia.org/w/api.php?action=parse&format=jsonfm&text=%7B%7B%23invoke%3AAvalia%C3%A7%C3%A3o%7Cqualidade%7Cp%C3%A1gina%3DGilching%7D%7D&contentmodel=wikitext

The result will contain something in the form

Qualidade $1 ($2)

where $1 is the quality and $2 the reason for the article having that quality (and not a greater quality).

Or, to make it easier to parse the result, it could be used the scribuntu api to run:

p=require('Module:Avaliação'); print(p._getClassForPage('Brasil')[1])

as in
https://pt.wikipedia.org/w/api.php?action=scribunto-console&format=jsonfm&title=Module%3AAvalia%C3%A7%C3%A3o&question=p%3Drequire(%27Module%3AAvalia%C3%A7%C3%A3o%27)%3B%20print(p._getClassForPage(%27Brasil%27)%5B1%5D)
which returns:

{
    "type": "normal",
    "print": "6\n",
    "return": "",
    "session": 1975352193,
    "sessionSize": 71,
    "sessionMaxSize": 500000,
    "sessionIsNew": ""
}

Out of the 951 featured articles (quality 6) on ptwiki:

  • 3 (0,3%) are predicted as having quality 3
  • 37 (3,9%) are predicted as having quality 4
  • 138 (14,5%) are predicted as having quality 5 (Good article)
  • 773 (81,3) are predicted as having quality 6 (Featured article)

This is table shows the specific articles:
https://pt.wikipedia.org/w/index.php?title=Wikip%C3%A9dia:P%C3%A1gina_de_testes/1&oldid=58092216

Out of the 974 good articles (quality 5) on ptwiki:

  • 12 (1,2%) are predicted as having quality 3
  • 32 (3,3%) are predicted as having quality 4
  • 796 (81,7%) are predicted as having quality 5 (Good article)
  • 134 (13,8%) are predicted as having quality 6 (Featured article)

This is table shows the specific articles:
https://pt.wikipedia.org/w/index.php?title=Wikip%C3%A9dia:Artigos_bons/Conte%C3%BAdo&oldid=58092346

That sounds pretty decent. I looked at https://pt.wikipedia.org/wiki/Benjamin_Abrah%C3%A3o_Botto -- one of the articles that was predicted to be a 3. We can get the straight prediction from ORES with this: https://ores.wikimedia.org/v3/scores/ptwiki/57185234/articlequality

After a bit of fiddling around, I found that I could dramatically increase the predicted quality by increasing the # of refs. E.g., https://ores.wikimedia.org/v3/scores/ptwiki/57185234/articlequality?features&feature.wikitext.revision.ref_tags=50

@He7d3r, what do you think about this. In your opinion this really a 6? Would it be improved with more references or is the model expecting the wrong things from an article like this?

That one was evaluated in 2008, so standards probably have changed.

How should we interpret the different weighted sum values (shown in the parenthesis) for articles such as
https://pt.wikipedia.org/wiki/Ambientalismo
and
https://pt.wikipedia.org/wiki/Mesas_girantes
which have predictions 6 (5.74) and 6 (4.88) respectively?

Is it reasonable to say that the greater the difference between the prediction value and the weighted sum, the more uncertain/questionable the the prediction is?

That could be one way to think of it, yes. This is because the prediction is the label with the highest probability whereas the weighted sum throws a greater light on the actual understanding the model makes of the article.

Hope this answers your question.

That one was evaluated in 2008, so standards probably have changed.

Looking at it, I think it wouldn't stand a change of being considered a 6 right now. Lack of references, short sections and probably incomplete.

That seems to make the models prediction right, is that correct @GoEThe ?

That seems to make the models prediction right, is that correct @GoEThe ?

I would definitely say so.

The model seems to be working with an accuracy of 80% from the numbers @He7d3r has reported. I think this review could benefit from extra pairs of eyes on it. Can @GoEThe and @He7d3r bring in more community members to check out https://etherpad.wikimedia.org/p/jsForPtwikiOres ?

Do you think this would be a good idea?

Yes, that's a good idea. You can post this message on the Esplanada, our version of the Village Pump (https://pt.wikipedia.org/wiki/Wikip%C3%A9dia:Esplanada/an%C3%BAncios). You can either post it in English or we can translate it to Portuguese before you post it.

Whichever is easiest. I would personally prefer that someone could translate it to Portuguese first though. Maybe you can add a Portuguese translation at the top of the same pad?

Once that is done, I would love to post a message on the Esplanada about our work!

The "weighted sum" is essentially the center of the probability distribution across classes. E.g.

Article A:

1: 10%
2: 10%
3: 10%
4: 10%
5: 10%
6: 50%

Article B:

1: 5%
2: 5%
3: 5%
4: 5%
5: 5%
6: 75%

For article A, we get a prediction of "6" with weighed sum of 4.5. For article B, get a prediction of "6" with a weighted sum of 5.25. For article A, we have a lot less certainty so the weighted sum reflects that. For article B, we have a lot more certainty and the weighted sum reflects that. Usually, certainty is high and we get a weighted sum that roughly matches the predicted (most likely) class. When we don't, that probably indicates that there are many articles with similar qualities ("features" in the ML literature) that fell across a wide spectrum of classes.

Whichever is easiest. I would personally prefer that someone could translate it to Portuguese first though. Maybe you can add a Portuguese translation at the top of the same pad?

Once that is done, I would love to post a message on the Esplanada about our work!

I added the translation. Great work by the way. I just tested the script in a couple of random articles and the scores make a lot of sense up to now.

The script is all credit to @Halfak 😄

Thank you for the translation. I will add my message on Esplanada now!

(...)
When we don't, that probably indicates that there are many articles with similar qualities ("features" in the ML literature) that fell across a wide spectrum of classes.

Makes sense. Thanks for the explanation.

@GoEThe Added my message on Esplanada, do check it out and correct it if anything seems wrong about it

I would just paste the content of the etherpad on Esplanada. And don't forget to finish you message with four tildes ("~~~~"), so that your signature is added to the message.

@He7d3r has updated the message. What do you think about it now? I think we should see some input from the community on the model (articlequality) soon. In the meantime what can we achieve for draftquality?

That is fine. We had a bunch of comments on Telegram (https://t.me/wikipediapt), not easily recovered here. I think it might be easier for people to gather feedback in a wiki page than here.

@He7d3r has updated the message. What do you think about it now? I think we should see some input from the community on the model (articlequality) soon. In the meantime what can we achieve for draftquality?

See T246667#6079484.

See https://www.mediawiki.org/wiki/ORES/Issues/Draft_quality for a centralized place to report issues. I'd also be open to having a page on ptwiki that works roughly like this.

@Halfak: what should be considered as a true positive in these multi-class classification problems? (when filling the template misclassification report at mw:ORES/Issues/Article quality)
Would a "featured article" be a "positive" case for the articlequality model, and any other level is a "negative"? Or something else?
What about the draftquality model? (in this case it does not even seem to have any implicit order between the classes (e.g. nothing like OK < spam < unsuitable)

Oh good question. If you don't set the "type" param on {{misclassification report}}, it just calls it a "misclassification".

@Halfak do you have a quick way to get how many assessments by each user in the dataset ptwiki.balanced_labelings.*_2020.json which was used for articlequality model? Are we getting labels from a diverse set of users or mostly from just a few users?

Hmm. No quick way to do that... We could modify the label extractor to grab it though. It would require some refactoring. But I imagine this could be of general use.

The following pull request is related to improving the articlequality model: https://github.com/wikimedia/articlequality/pull/122

Could the number of labels per article have a negative impact on the quality of the model?
These are the frequency of the number of labels/page in the full set and in the 9k sample:

$ cat ptwiki.labelings.20200301.json | json2tsv page_title | sort | uniq -c | cut -c-8 | sort |uniq -c
 181477       1 
   3042       2 
    517       3 
    100       4 
     19       5 
      2       6 
      2       7 

$ cat ptwiki.labeled_revisions.w_cache.9k_2020.json | json2tsv page_title | sort | uniq -c | cut -c-8 | sort |uniq -c
   6807       1 
    850       2 
    119       3 
     11       4 
      1       6 

(these numbers are from the datasets I got when I run make models/ptwiki.wp10.gradient_boosting.model on my machine, with the patch above)

Halfak triaged this task as Medium priority.May 5 2020, 2:44 PM

Here are some graphs showing the evolution of the assessments extracted from ptwiki:

all-labelings-yearly-assessments-stacked.png (288×1 px, 35 KB)

all-labelings-monthly-assessments-stacked.png (360×1 px, 116 KB)

all-labelings-daily-assessments-cumulative.png (360×1 px, 56 KB)

It is not strictly related to the articlequality model performance, but you might find it interesting anyway.

This is the code used to generate them: https://gist.github.com/he7d3r/0889ef07e912c88daee8a197f1bba728

Wow. This is really awesome. I wonder what would happen if we retrained the models on recent data only. In enwiki we found that the definition of quality changed over time. There should be plenty of observation after 2014 to give us good signal.

I think this might make some improvements to some of our other models too :)

After patching¹ the extractor to also collect user names, I found that these are the top 10 users who added/modified the most assessments:

FMTbot            91366
Rei-bot           25155
BotStats          15830
Fabiano Tatsch    14829
Leandro Drudo      4660
GoEThe             3172
Burmeister         3128
Rei-artur          2965
FilRBot            2444
VítoR Valente      1895

Then I produced² the following graphs showing the number of labels added/modified by bots³ by year, for each of the six quality levels. There are many quality 1 and 2 assessments made by bots.

It could also be interesting retrain the model without bot assessments.

all-bot-labelings-yearly-assessments-stacked.png (2×936 px, 97 KB)

Notes
¹ I've only changed two lines, as shown in this patch: https://gist.github.com/he7d3r/f945adb3f47c2bd22d0e7589ef6afdce#file-user-patch
² Using this code:
https://gist.github.com/he7d3r/f945adb3f47c2bd22d0e7589ef6afdce#file-ptwiki-labelings-20200301-json-user-ipynb
³ User names containing "bot"

Wow. This is really awesome. I wonder what would happen if we retrained the models on recent data only. In enwiki we found that the definition of quality changed over time. There should be plenty of observation after 2014 to give us good signal.

I've made two tests:

  1. removing keeping only bots (but still using the whole period); (Note: By mistake, I forgot the -v flag in the grep below, so the results are inverted, that is, bot_only, instead of no_bots)
  2. restricting the dataset to 2014-2020; (Note: this case didn't need the -v flag)
$ cat datasets/ptwiki.labelings.20200301.user.json | grep -P '"user": "[^"]*([Bb][Oo][Tt]|[Rr][Oo][Bb][ÔôOo])[^"]*"' > datasets/ptwiki.labelings.20200301.user.no_bots.json
$ cat datasets/ptwiki.labelings.20200301.user.no_bots.json | json2tsv wp10 | sort | uniq -c
 117182 1
  19351 2
    759 3
     20 4
     95 5
    203 6
$ make ptwiki_models

This is the resulting model info for this case: https://gist.github.com/he7d3r/1a617f50ab63ba57a9254377eddd42d1#file-ptwiki-wp10-v2-full-period-bots-removed-md
In particular, we have this:

accuracy (micro=0.98, macro=0.981):
	    1      2      3      4      5      6
	-----  -----  -----  -----  -----  -----
	0.978  0.986  0.976  0.969  0.988  0.988

Now, for the second scenario:

$ cat datasets/ptwiki.labelings.20200301.json | grep -P '"timestamp": "20(1[4-9]|20)' > datasets/ptwiki.labelings.20200301.since.2014.json
$ cat datasets/ptwiki.labelings.20200301.since_2014.json | json2tsv wp10 | sort | uniq -c
   7516 1
   3314 2
   1247 3
    674 4
    630 5
    654 6
$ make ptwiki_models

This is the resulting model info for this case: https://gist.github.com/he7d3r/1a617f50ab63ba57a9254377eddd42d1#file-ptwiki-wp10-v3-since-2014-bots-included-md
In particular, we have this:

accuracy (micro=0.787, macro=0.868):
	    1      2      3      4      5      6
	-----  -----  -----  -----  -----  -----
	0.772  0.791  0.861  0.914  0.934  0.935

PS: I didn't change the thresholds in the Makefile, so the samples were not as balanced as might be wanted:
(Note: By mistake, I forgot the -v flag in the grep above, so the results for the first case are inverted, that is, they contain bot_only, instead of no_bots)

$ cat datasets/ptwiki.balanced_labelings.9k_2020.no_bots.json | json2tsv wp10 | sort | uniq -c
   1500 1
   1500 2
    759 3
     20 4
     95 5
    203 6
$ cat datasets/ptwiki.balanced_labelings.9k_2020.since_2014.json | json2tsv wp10 | sort | uniq -c
   1500 1
   1500 2
   1247 3
    674 4
    630 5
    654 6

Wow! Removing bot contributions ended up removing a ton of labels. What do you think explains that? Are there really only bots adding the level 4 label? Or do you think we messed up our extraction process?

Ultimately, I'm not surprised that we can build good models on a small amount of data. If this is a good filtering process, let's do it.

@Halfak: Oops... I missed the -v flag when I used grep to remove the bot assessments. So, instead of considering only human assessments, I extracted only the bot assessments! Once I add that flag, the number of assessments by humans seems more reasonable:

$ cat datasets/ptwiki.labelings.20200301.user.json |grep -v -P '"user": "[^"]*([Bb][Oo][Tt]|[Rr][Oo][Bb][ÔôOo])[^"]*"' | json2tsv wp10 | sort | uniq -c
  28403 1
  13343 2
   5329 3
   2209 4
   1458 5
   1281 6

In this case, the explanation for such a high accuracy is likely that the bots assessments are very predictable (it is hardcoded in their dna code ;-).

Updated info (as of commit c3a66b0 plus the specific changes which define each of the tests):

accuracy (micro=0.8, macro=0.861):
	    1      2      3      4      5      6
	-----  -----  -----  -----  -----  -----
	0.781  0.827  0.877  0.899  0.875  0.908
$ cat datasets/ptwiki.labelings.20200301.remove_bots.json | json2tsv wp10 | sort | uniq -c
 145657 1
  32807 2
   6177 3
   2346 4
   1646 5
   1542 6
$ cat datasets/ptwiki.balanced_labelings.9k_2020.remove_bots.json | json2tsv wp10 | sort | uniq -c
   1500 1
   1500 2
   1500 3
   1500 4
   1500 5
   1328 6
accuracy (micro=0.81, macro=0.875):
	    1      2      3      4      5      6
	-----  -----  -----  -----  -----  -----
	0.799  0.806  0.867  0.915  0.928  0.933
$ cat datasets/ptwiki.labelings.20200301.since_2014.json | json2tsv wp10 | sort | uniq -c
   7537 1
   3346 2
   1276 3
    690 4
    653 5
    684 6
$ cat datasets/ptwiki.balanced_labelings.9k_2020.since_2014.json | json2tsv wp10 | sort | uniq -c
   1500 1
   1500 2
   1276 3
    690 4
    653 5
    684 6

It looks like data since 2014 is most consistent and that should make sense as the assessment process is socialized and standardized. What do you think of submitting a PR with that model?

@GoEThe @He7d3r

We haven't received any misclassification reports from the ptwiki community and are planning to go ahead and deploy the current version of the model.

You can let us know here if there is anything you want us to look into before that, we'd be more than happy to do so.

@Danilo generated the following table comparing articlequality scores for the latest version of all articles to the scores which would be produced by the Python script which is/was used to make bot assessments:

MariaDB [s51206__ptwikis]> SELECT pe_qualidade, SUM(pe_qores = 0) ORES_0, SUM(pe_qores = 1) ORES_1, SUM(pe_qores = 2) ORES_2, SUM(pe_qores = 3) ORES_3, SUM(pe_qores = 4) ORES_4, SUM(pe_qores = 5) ORES_5, SUM(pe_qores = 6) ORES_6 FROM page_extra GROUP BY pe_qualidade ORDER BY pe_qualidade;
+--------------+--------+--------+--------+--------+--------+--------+--------+
| pe_qualidade | ORES_0 | ORES_1 | ORES_2 | ORES_3 | ORES_4 | ORES_5 | ORES_6 |
+--------------+--------+--------+--------+--------+--------+--------+--------+
|            0 |  68218 |      0 |      0 |      0 |      0 |      0 |      0 |
|            1 |      3 | 618819 | 204187 |  27523 |   1847 |   3562 |   3261 |
|            2 |      0 |   5565 |  69323 |  24777 |   1390 |   7496 |    350 |
|            3 |      0 |     71 |    472 |  14361 |   1861 |   7412 |    572 |
|            4 |      0 |      5 |     10 |   2948 |   2361 |   2978 |   1208 |
|            5 |      0 |      0 |     16 |     59 |    136 |   1056 |    161 |
|            6 |      0 |      0 |      0 |     35 |    190 |    188 |    782 |
+--------------+--------+--------+--------+--------+--------+--------+--------+
7 rows in set (3.70 sec)

(the label is set to zero if the quality is unknown, possibly due to the page being deleted)

The bot assessments were produced by this script and they are also shown by the tool which is at https://ptwikis.toolforge.org/Avalia%C3%A7%C3%A3o:r123456 (it is possible to change the revid in the url).