Research:Newsletter/2024/May: Difference between revisions

Content deleted Content added
reddi
No edit summary
 
(13 intermediate revisions by 5 users not shown)
Line 1:
|{{{1|ChatGPT did not kill Wikipedia, but might have reduced its growth}}}
<noinclude>{{Signpost draft
|title = ChatGPT did not kill Wikipedia, but might have reduced its growth
|blurb = And various research findings about Wikidata and knowledge graphs.
|Ready-for-copyedit = Yes
|Copyedit-done = yes
|Final-approval = yes
|piccyfilename = File:The fundamental knowledge scaffolding model.png
|piccy-credits = Bagnoli et al.
|piccy-license = CC BY 4.0
|piccy-xoffset = 485
|piccy-yoffset = 213
|piccy-scaling = 800
}}
{{Wikipedia:Wikipedia Signpost/Templates/RSS description
|1=<!-- LEAVE BLANK to use "<title>: <blurb>" (using title and blurb from above), or replace with a custom description for the RSS feed -->
}}{{Wikipedia:Wikipedia Signpost/Templates/Signpost-header|||}}</noinclude>
 
===Actually, Wikipedia was not killed by ChatGPT – but might it be growing a little less because of it===
{{Wikipedia:Wikipedia Signpost/Templates/Signpost-article-header-v2
A preprint<ref>{{Cite| publisher = arXiv| doi = 10.48550/arXiv.2405.10205| last1 = Reeves| first1 = Neal| last2 = Yin| first2 = Wenjie| last3 = Simperl| first3 = Elena| title = Exploring the Impact of ChatGPT on Wikipedia Engagement| date = 2024-05-22| url = http://arxiv.org/abs/2405.10205}}</ref> by three researchers from King's College London tries to identify the impact of the November 2022 launch of [[ChatGPT]] on "Wikipedia user metrics across four areas: page views, unique visitor numbers, edit counts and editor numbers within twelve language instances of Wikipedia." The analysis concludes that
|{{{1|ChatGPT did not kill Wikipedia, but might have reduced its growth}}}
|By [[User:HaeB|Tilman Bayer]]
}}
 
{{Wikipedia:Wikipedia Signpost/Templates/Signpost-block-start-v2|fullwidth=no<!--CHANGE TO YES FOR A 'FULLWIDTH' ARTICLE-->}}
 
 
{{WRN}}
 
===Actually, Wikipedia was not killed by ChatGPT – but might it be growing a little less because of it===
A preprint<ref>{{Cite| publisher = arXiv| doi = 10.48550/arXiv.2405.10205| last1 = Reeves| first1 = Neal| last2 = Yin| first2 = Wenjie| last3 = Simperl| first3 = Elena| title = Exploring the Impact of ChatGPT on Wikipedia Engagement| date = 2024-05-22| url = http://arxiv.org/abs/2405.10205}}</ref> by three researchers from King's College London tries to identify the impact of the November 2022 launch of [[ChatGPT]] on "Wikipedia user metrics across four areas: page views, unique visitor numbers, edit counts and editor numbers within twelve language instances of Wikipedia." The analysis concludes that
<blockquote style="padding-left:1.0em; padding-right:1.0em; background-color:#eaf8f4;">
"any impact has been limited and while [ChatGPT] may have led to lower growth in engagement [i.e. Wikipedia pageviews] within the territories where it is available, there has been no significant drop in usage or editing behaviours"
Line 35 ⟶ 10:
"At this time, there is limited published research which demonstrates how and why users have been engaging with ChatGPT, but early indications would suggest users are turning to it in place of other information gathering tools such as search engines [...]. Indeed, question answering, search and recommendation are key functionalities of large language models identified in within the literature [...]"
</blockquote>
However, like many other current concerns about AI, these have been speculative and anecdotal. Hence the value of a quantitative analysis that tries to identify the causal impact of ChatGPT on Wikipedia in a statistically rigorous manner. Without conducting experiments though, i.e. based on observational data alone, it is not easy to establish that particular change or external event caused persistent increases or decreases in Wikipedia usage overall (as opposed to one-time spikes from particular events, or [[m:Research:Newsletter/2019/November#Seasonality_in_pageviews_reflects_plants_blooming_and_birds_migrating|recurring seasonal changes]]). The paper's literature review section cites only one previous publication which achieved that for Wikipedia pageviews: a 2019 paper by three authors from the Wikimedia Foundation (see our earlier coverage: [[m:Research:Newsletter/2019/December#An_awareness_campaign_in_India_did_not_affect_Wikipedia_pageviews,_but_a_new_software_feature_did|"An awareness campaign in India did not affect Wikipedia pageviews, but a new software feature did"]]). They had used a fairly sophisticated statistical approach ([[Bayesian structural time series]]) to first create a counterfactual forecast of Wikipedia traffic in a world where the event in question did not happen, and then interpret the difference between that forecast and the actual traffic as related to the event's impact. Their method successfully estimated the impact of a software change (consistent with the results of a previous randomized experiment conducted by this reviewer), as highlighted by the authors of the present paper: "Technological changes can [...] have significant and pervasive changes in user behaviour as demonstrated by the significant and persistent drop in pageviews observed in 2014 [sic, actually 2018] when Wikipedia introduced a page preview feature allowing desktop users to explore Wikipedia content without following links." The WMF authors concluded their 2019 paper by expressing the hope that "it lays the groundwork for exploring more standardized methods of predicting trends such as page views on Wikipedia with the goal of understanding the effect of external events."
 
In contrast, the present paper starts out with a fairly crude statistical method.
Line 48 ⟶ 23:
<blockquote style="padding-left:1.0em; padding-right:1.0em; background-color:#eaf8f4;">
"selected to ensure geographic diversity covering both the global north and south. When selecting languages, we looked at three key factors:
# The common crawl size of the [[GPT-3]] main training data as a proxy for the effectiveness of ChatGPT in that language.
# The number of Wikipedia articles in that language.
# The number of global first and second language speakers of that language.
Line 57 ⟶ 32:
Then, "[a]s a first step to assess any impact from the release of ChatGPT, we performed paired statistical tests comparing aggregated statistics for each language for a period before and after release" (the paper leaves it unclear how long these periods were). E.g.
<blockquote style="padding-left:1.0em; padding-right:1.0em; background-color:#eaf8f4;">
"For page views, we first performed a two-sided [[Wilcoxon rank-sum test|Wilcoxon Rank Sum test]] to identify whether there was a difference between the two periods (regardless of
directionality). We found a statistically significant different for five of the six languages where ChatGPT was available and two of the six languages where it was not. However, when repeating this test with a one-sided test to identify if views in the period after release were lower than views in the period before release, we identified a statistically significant result in Swedish, but not for the remaining 11 languages."
</blockquote>
For the other three metrics (unique users, active editors, and edits) the results were similarly ambiguous, motivating the authors to resort to a somewhat more elaborate approach:
<blockquote style="padding-left:1.0em; padding-right:1.0em; background-color:#eaf8f4;">
"While the [[Wilcoxon signed-rank test|Wilcoxon Signed-Rank test]]<!--sic--> provided weak evidence for changes among the languages before and after the release of ChatGPT, we note ambiguities in the findings and limited accounting for seasonality. To address this and better evaluate any impact, we performed a [[panel regression]] using data for each of the four metrics. Additionally, to account for longer-term trends, we expanded our sample period to cover a period of three years with data from the 1st of January in 2021 to the 1st of January 2024."
</blockquote>
 
While this second method accounts for weekly and yearly seasonality, it too does not attempt to disentangle the impact of ChatGPT from ongoing longer term trends. (While the given regression formula includes a language-specific [[fixed effect]], it doesn't have one for the availability of ChatGPT in that language, and also no slope term.) The usage of Wikipedia might well have been decreasing or increasing steadily during those three years for other reasons (say the basic fact that every year, the number of Internet users worldwide [https://ourworldindata.org/grapher/number-of-internet-users?country=~OWID_WRL increases by hundreds of millions]). Indeed, a naive application of the method would yield the counter-intuitive conclusion that ChatGPT ''increased'' Wikipedia traffic in those languages where it was available:
<blockquote style="padding-left:1.0em; padding-right:1.0em; background-color:#eaf8f4;">
"For all six languages, [using panel regression] we found a statistically significant difference in page views associated with whether ChatGPT had launched when controlling for day of the week and week of the year. In five of the six languages, this was a positive effect with Arabic featuring the most significant rise (18.3%) and Swedish featuring the least (10.2%). The only language where a fall was observed was Swahili, where page views fell by 8.5% according to our model. However, Swahili page viewing habits were much more sporadic and prone to outliers perhaps due to the low number of visits involved."
</blockquote>
 
To avoid this fallacy (and partially address the aforementioned lack of trend analysis), the authors apply the same method to their (so to speak) [[control group]], i.e. "the six language versions of Wikipedia where ChatGPT is was unavailable":
<blockquote style="padding-left:1.0em; padding-right:1.0em; background-color:#eaf8f4;">
"Once again, results showed a statistically significant rise across five of the six languages. However, in contrast with the six languages where ChatGPT was available, these rises were generally much more significant. For Farsi, for example, our model showed a 30.3% rise, while for Uzbek and Vietnamese we found a 20.0% and 20.7% rise respectively. In fact, four of the languages showed higher rises than all of the languages where ChatGPT was available except Arabic, while one was higher than all languages except Arabic and Italian."
Line 81 ⟶ 56:
In the "Conclusion" section, the authors summarize this as follows:
<blockquote style="padding-left:1.0em; padding-right:1.0em; background-color:#eaf8f4;">
Our findings suggest an increase in page visits and visitor numbers [i.e. page views and unique devices] that occurred across languages regardless of whether ChatGPT was available or not, although the observed increase was generally smaller in languages from countries where it was available. Conversely,
we found little evidence of any impact for edits and editor numbers. We conclude any impact has been limited and while it may have led to lower growth in engagement within the territories where it is available, there has been no significant drop in usage or editing behaviours.
</blockquote>
 
Unfortunately this preprint does not adhere to research best practices about providing replication data or code (let alone a [[preregistration]]), making it impossible to e.g. check whether the analysis of pageviews included automated traffic by spiders etc. (the default setting in the Wikimedia Foundation's [https://wikimedia.org/api/rest_v1/#/Pageviews%20data/get_metrics_pageviews_aggregate__project___access___agent___granularity___start___end_ Pageviews API]), which would considerably impact the interpretations of the results. The paper itself notes that such an attempt was made for edits ("we tried to limit the impact of bots by requesting only contributions from users") but doesn't address the analogous question for pageviews.
 
An earlier version of the paper as uploaded to ArXiv had the title "'The Death of Wikipedia?' – Exploring the Impact of ChatGPT on Wikipedia Engagement", which was later shortened by removing the attention-grabbing "Death of Wikipedia". As explained in the paper itself, that term refers to "an anonymous Wikipedia editor's fears that generative AI tools may lead to the death of Wikipedia" – specifically, the essay [[User:Barkeep49/Death of Wikipedia]], via its mention in a ''New York Times'' article, see [[Wikipedia:Wikipedia Signpost/2023-08-01/In the media]]. While the paper's analysis conclusively disproves that Wikipedia has died as of May 2024, it is worth noting that Barkeep49 did not necessarily predict the kind of immediate, lasting drop that the paper's methodology was designed to measure. In fact, the aforementioned NYT article quoted him as saying (in July 2023) "It wouldn't surprise me if things are fine for the next three years [for Wikipedia] and then, all of a sudden, in Year 4 or 5, things drop off a cliff." Nevertheless, the paper's findings leave reason for doubt whether this will be the first of the many [[predictions of the end of Wikipedia]] to become true.
 
===Briefly===
Line 93 ⟶ 67:
 
===Other recent publications===
''Other recent publications that could not be covered in time for this issue include the items listed below. Contributions, whether reviewing or summarizing newly published research, [[m:Research:Newsletter#How to contribute|are always welcome]].''
 
===="Do We Trust ChatGPT as much as Google Search and Wikipedia?"====
Line 104 ⟶ 78:
"Among all three information sources, Google was the most trusted platform, favored by 57% of our participants, followed by Wikipedia, which was liked by 29% of our participants [...]. Four participants expressed that ChatGPT is less credible than Google because it does not disclose the original source of the information."
</blockquote>
It should be noted that the authors' relieved conclusion ("thankfully") is somewhat in contrast with the result of a larger scale blind experiment published last year in preprint form (see our coverage: "[[m:Research:Newsletter/2023/September#In_blind_test%2C_readers_prefer_ChatGPT_output_over_Wikipedia_articles_in_terms_of_clarity%2C_and_see_both_as_equally_credible|In blind test, readers prefer ChatGPT output over Wikipedia articles in terms of clarity, and see both as equally credible]]").
 
 
====WikiChat, "the first few-shot LLM-based chatbot that almost never hallucinates"====
|caption = "All WikiChat components, and a sample conversation about an upcoming movie [Oppenheimer], edited for brevity. The steps taken to generate a response include (1) generating a query to retrieve from Wikipedia, (2) summarizing and filtering the retrieved passages, (3) generating a response from an LLM, (4) extracting claims from the LLM response (5) fact-checking the claims in the LLM response using retrieved evidence, (6) drafting a response, and (7) refining the response." (Figure 1 from the paper)
{{Wikipedia:Wikipedia Signpost/Templates/Inline image
From the abstract of this paper (by three graduate students at Stanford University's computer science department and [[w:Monica S. Lam|Monica S. Lam]] as fourth author):<ref>{{Cite conference| publisher = Association for Computational Linguistics| doi = 10.18653/v1/2023.findings-emnlp.157| conference = [[EMNLP]] 2023| pages = 2387–2413| <!--editors = Houda Bouamor, Juan Pino, Kalika Bali (eds.)|--> last1 = Semnani| first1 = Sina| last2 = Yao| first2 = Violet| last3 = Zhang| first3 = Heidi| last4 = Lam| first4 = Monica| title = WikiChat: Stopping the Hallucination of Large Language Model Chatbots by Few-Shot Grounding on Wikipedia| book-title = Findings of the Association for Computational Linguistics: EMNLP 2023| location = Singapore| date = December 2023| url = https://aclanthology.org/2023.findings-emnlp.157}} [https://github.com/stanford-oval/WikiChat Code]</ref>
|image = File:All WikiChat components, and a sample conversation about an upcoming movie, edited for brevity.svg
|size = 650px
|align = center
|alt = A diagram.
|caption = "All WikiChat components, and a sample conversation about an upcoming movie [Oppenheimer], edited for brevity. The steps taken to generate a response include (1) generating a query to retrieve from Wikipedia, (2) summarizing and filtering the retrieved passages, (3) generating a response from an LLM, (4) extracting claims from the LLM response (5) fact-checking the claims in the LLM response using retrieved evidence, (6) drafting a response, and (7) refining the response." (Figure 1 from the paper)
}}
From the abstract of this paper (by three graduate students at Stanford University's computer science department and [[w:Monica S. Lam|Monica S. Lam]] as fourth author):<ref>{{Cite conference| publisher = Association for Computational Linguistics| doi = 10.18653/v1/2023.findings-emnlp.157| conference = [[EMNLP]] 2023| pages = 2387–2413| <!--editors = Houda Bouamor, Juan Pino, Kalika Bali (eds.)|--> last1 = Semnani| first1 = Sina| last2 = Yao| first2 = Violet| last3 = Zhang| first3 = Heidi| last4 = Lam| first4 = Monica| title = WikiChat: Stopping the Hallucination of Large Language Model Chatbots by Few-Shot Grounding on Wikipedia| book-title = Findings of the Association for Computational Linguistics: EMNLP 2023| location = Singapore| date = December 2023| url = https://aclanthology.org/2023.findings-emnlp.157}} [https://github.com/stanford-oval/WikiChat Code]</ref>
<blockquote style="padding-left:1.0em; padding-right:1.0em; background-color:#eaf8f4;">
"This paper presents the first few-shot LLM-based chatbot that almost never hallucinates and has high conversationality and low latency. WikiChat is grounded on the English Wikipedia, the largest curated free-text corpus. WikiChat generates a response from an LLM, retains only the grounded facts, and combines them with additional information it retrieves from the corpus to form factual and engaging responses. We distill WikiChat based on GPT-4 into a 7B-parameter [[LLaMA]] model with minimal loss of quality, to significantly improve its latency, cost and privacy, and facilitate research and deployment. [...] we show that our best system achieves 97.3% factual accuracy in simulated conversations. It significantly outperforms all retrieval-based and LLM-based baselines, and by 3.9%, 38.6% and 51.0% on head, tail and recent knowledge compared to GPT-4. Compared to previous state-of-the-art retrieval-based chatbots, WikiChat is also significantly more informative and engaging, just like an LLM. WikiChat achieves 97.9% factual accuracy in conversations with human users about recent topics, 55.0% better than GPT-4, while receiving significantly higher user ratings and more favorable comments."
</blockquote>
An online demo is available at https://wikichat.genie.stanford.edu/ . The [https://github.com/stanford-oval/WikiChat code] underlying the paper has been released under an open source license, and [https://github.com/stanfod-oval/WikiChat?tab=readme-ov-file#run-a-distilled-model-for-lower-latency-and-cost two distilled models] (for running the chatbot locally without relying e.g. on OpenAI's API) have been published on Huggingface.
 
See also our review of a previous (preprint) version of this paper: "[[m:Research:Newsletter/2023/July#Wikipedia-based_LLM_chatbot_%22outperforms_all_baselines%22_regarding_factual_accuracy|Wikipedia-based LLM chatbot 'outperforms all baselines' regarding factual accuracy]]"
 
===="A Simple Model of Knowledge Scaffolding Applied to Wikipedia Growth"====
From the abstract:<ref>{{Cite journal| doi = 10.3390/fi15020067| issn = 1999-5903| volume = 15| issue = 2| pages = 67| last1 = Bagnoli| first1 = Franco| last2 = de Bonfioli Cavalcabo’| first2 = Guido| title = A Simple Model of Knowledge Scaffolding Applied to Wikipedia Growth| journal = Future Internet| date = February 2023| doi-access = free}}</ref>
<blockquote style="padding-left:1.0em; padding-right:1.0em; background-color:#eaf8f4;">
"We illustrate a simple model of knowledge scaffolding, based on the process of building a corpus of knowledge, each item of which is linked to “previous” ones. [...]. Our model can be used as a rough approximation to the asymptotic growth of Wikipedia, and indeed, actual data show a certain resemblance with our model. Assuming that the user base is growing, at beginning, in an exponential way, one can also recover the early phases of Wikipedia growth."
Line 132 ⟶ 100:
 
===="males outperform females" when navigating Wikipedia under time pressure====
From the abstract:<ref>{{Cite journal| doi = 10.1038/s41598-024-58305-2| issn = 2045-2322| volume = 14| issue = 1| pages = 8331| last1 = Zhu| first1 = Manran| last2 = Yasseri| first2 = Taha| last3 = Kertész| first3 = János| title = Individual differences in knowledge network navigation| journal = Scientific Reports| date = 2024-04-09| pmid = 38594309| arxiv = 2303.10036| bibcode = 2024NatSR..14.8331Z| url = https://www.nature.com/articles/s41598-024-58305-2}}</ref>
<blockquote style="padding-left:1.0em; padding-right:1.0em; background-color:#eaf8f4;">
"we conducted an online experiment where participants played a navigation game on Wikipedia and completed personal information questionnaires. Our analysis shows that age negatively affects knowledge space navigation performance, while multilingualism enhances it. Under time pressure, participants’ performance improves across trials and males outperform females, an effect not observed in games without time pressure. In our experiment, successful route-finding is usually not related to abilities of innovative exploration of routes."</blockquote>
Line 158 ⟶ 126:
From the abstract:<ref>{{Cite thesis| publisher = University of Southampton| last = Kaffee| first = Lucie-Aimée| title = Multilinguality in knowledge graphs| date = October 2021| url = https://eprints.soton.ac.uk/456783/}}</ref>
<blockquote style="padding-left:1.0em; padding-right:1.0em; background-color:#eaf8f4;">
"In this thesis, we present studies to assess and improve the state of labels and languages in knowledge graphs and apply multilingual information. We propose ways to use multilingual knowledge graphs to reduce gaps in coverage between languages. We explore the current state of language distribution in knowledge graphs by developing a framework - based on existing standards, frameworks, and guidelines - to measure label and language distribution in knowledge graphs. We apply this framework to a dataset representing the web of data, and to Wikidata. [...] Due to its multilingual editors, Wikidata has a better distribution of languages in labels. [...] A way of overcoming the lack of multilingual information in knowledge graphs is to transliterate and translate knowledge graph labels and aliases. We propose the automatic classification of labels into transliteration or translation in order to train a model for each task. [...] A use case of multilingual labels is the generation of article placeholders for Wikipedia using neural text generation in lower-resourced languages. On the basis of surveys and semi-structured interviews, we show that Wikipedia community members find the placeholder pages, and especially the generated summaries, helpful, and are highly likely to accept and reuse the generated text."
</blockquote>
''See also [[mw:Extension:ArticlePlaceholder]] and our coverage of a subsequent paper: [[m:Research:Newsletter/2023/November#"Using_natural_language_generation_to_bootstrap_missing_Wikipedia_articles:_A_human-centric_perspective"|"Using natural language generation to bootstrap missing Wikipedia articles: A human-centric perspective"]]''
 
 
Line 178 ⟶ 146:
From the paper:<ref>Cedric Möller, Jens Lehmann, Ricardo Usbeck: [http://www.semantic-web-journal.net/content/survey-english-entity-linking-wikidata-0 Survey on English Entity Linking on Wikidata]. In: Semantic Web Journal, Special issue: Latest Advancements in Linguistic Linked Data, 2021; also as: {{cite arXiv | eprint = 2112.01989| last1 = Möller| first1 = Cedric| last2 = Lehmann| first2 = Jens| last3 = Usbeck| first3 = Ricardo| title = Survey on English Entity Linking on Wikidata| date = 2021-12-03 | class = cs.CL}} [https://github.com/semantic-systems/ELEnglishWD Code]</ref>
<blockquote style="padding-left:1.0em; padding-right:1.0em; background-color:#eaf8f4;">
"[[Entity Linking]] (EL) is the task of connecting already marked mentions in an utterance to their corresponding entities in a knowledge graph (KG) [...]. In the past, this task was tackled by using popular knowledge bases such as DBpedia [67], Freebase [11] or Wikipedia. While the popularity of those is still imminent, another alternative, named Wikidata [120], appeared."
</blockquote>
From the abstract:
Line 188 ⟶ 156:
{{reflist|30em}}
 
{{Signpost draft helper}}
 
<!--END OF ARTICLE -->
{{Wikipedia:Wikipedia Signpost/Templates/Signpost-block-end-v2}}
{{Wikipedia:Wikipedia Signpost/Templates/Signpost-article-end-v2}}
<noinclude>{{Wikipedia:Wikipedia Signpost/Templates/Signpost-article-comments-end||2024-04-25|}}</noinclude>