Jump to content

Parsoid/Parser Unification

From mediawiki.org
Revision as of 16:41, 13 August 2018 by SSastry (WMF) (talk | contribs) (Initial draft of the parser unification project high level summary)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Currently, we have two separate wikitext parsers that are used in MediaWiki on the Wikimedia cluster (and several other third-party MediaWiki installations). One is the original PHP parser, and the other is Parsoid (currently written in node.js and run as an independent service). Currently, the PHP parser is used for all desktop read views and for iOS Wikipedia app views. Parsoid is currently used to serve all editing clients (VisualEditor, Structured Discussions, Content Translation), linting tools (Extension:Linter), Android Wikipedia app, Kiwix offline reader, and the Google knowledge graph project.

The goal of this project is to arrive at a single parser that supports all clients and use cases.

This page will serve as a high-level page for tracking this unification project with links to other pages with additional details. Updates will continue to be published on this page. This project is primarily driven by the Parsing team with participation from the MediaWiki Platform team, all the internal teams that develop Parsoid clients, Community Liaisons team, Wikimedia wiki editor communities, and third party MediaWiki projects since this parser unification will touch them all.

Updates

As of July 1, 2018, this work will be undertaken as part of the Platform Evolution CDP.

Q1 2018-2019

In this quarter, we will be preparing the Parsoid codebase for prototyping a port. Specifically, here are a few things we'll be working towards.

  • Implement unit testing and performance testing features: These features let us port individual token and DOM transformers and verify correctness and test performance without needing a full functional port.
  • Migrate more promises in Parsoid to use newer async/yield code patterns: the benefit of this code pattern is that the code reads as if it is synchronous code and is readily migratable to PHP.
  • Explore migrating media processing to a post-processing step: This frees the core parsing step from blocking on database access.

January 2018 - June 2018

In this timeframe, we did a bunch of early experiments to get a sense of feasibility of a PHP port of Parsoid.

.... to be completed ....

Background

The two parsers use different internal processing models to convert wikitext to HTML.

The PHP parser is largely based on string manipulation via regular expressions with a goal of low latency conversion from wikitext to HTML.

Parsoid was born out of the VisualEditor project to support visual editing which required bidirectional conversation between wikitext and HTML with additional constraints on the wikitext generated from edited HTML. In 2012, as this project was in its infancy, it wasn't fully clear how viable this entire project was and where it would go. Since then, Parsoid has proved to be a succession project on its own and has supported a number of additional projects beyond VisualEditor.

Unification

Since around 2015, it has been clear that long-term, this two parser situation is untenable and we had to consolidate behind a single parser.

The long and short of it is that there are two aspects to arriving at a single parser.

  1. Bridging the differing processing models and consequent output and feature differences between the two parsers
  2. Addressing the language and architectural differences between the two parsers - the Parsing/Notes/Two Systems Problem page documents the differences between the two parsers and various possible scenarios for what the unified parser is going to look like. If you are interested in more details, please check out that page.

We are tackling these two aspects / work categories concurrently.

Replacing the HTML4 based Tidy with HTML5 based RemexHtml was one of the biggest projects under the first work category that has an independent utility and purpose above and beyond the parser unification project. Besides, that we have been continuously addressing the long tail of incompatibility between the two parsers besides continuing to address editing client features and requests.

As for the second work category, after a lot of internal debate and discussion, we have started evaluating and prototyping a port of Parsoid into PHP. Please check the Parsing/Notes/Moving Parsoid Into Core page for more details and background about this aspect of the parser unification project.