Identify a suitable PEG.js replacement in PHP
Closed, ResolvedPublic

Description

Parsoid uses a fork of PEG.js that @tstarling worked on. This fork adds some features to PEG.js to remove some JS / async related hacks to PEG.js and improve the tokenizer-generation performance of PEG.js.

To port Parsoid to PHP, we need a replacement for this PEG tokenizer.

Here are some options available to us.

  • There is phppegjs which is a plugin for PEG.js that generates a PHP tokenizer instead of a JS tokenizer. It also enables co-location of PHP and JS action code in the PEG tokenizer. But, this requires us to do one of the following:
    1. Abandon Tim's fork and adapt Parsoid-PHP to use this tokenizer. This is not a workable solution out of the box.
    2. Upstream some of Tim's changes to PEG.js, and then use the php-peg plugin. This requires us to separate out the necessary features and upstream them and for the maintainer to be interested in these changes.
    3. Implement the php-peg plugin on top of Tim's fork.
  • Evaluate PHP-PEG and see if our PEG grammar works with that
  • If performance of the tokenizer is a potential concern, evaluate C-PEG and see if our PEG grammar works with that.

This task is to evaluate our options and propose a suitable solution that meets our functional and performance requirements.

Event Timeline

ssastry moved this task from Backlog to Prototype / Evaluation on the Parsoid-PHP board.

I think option 3 (just rebase phppegjs on top of Tim's fork) is the shortest/safest way to go. There haven't been any commits to phppegjs since Jan 2018, so it's not like we're missing out on active upstream development if we do this. (But note https://github.com/nylen/phpegjs/pull/1 which we'll have to address).

Rather than try to have PHP and JS action rules co-exist, I think the quickest port will be to use our existing js2php tools to automatically convert our rules. This lets us automate the process so we can keep the rules in sync during a short transition window. My intuition is that our rules are fairly simple JS, and the phppegjs runtime should be similar enough to the JS runtime (or made to be so) that automatic conversion is feasible.

We would require a little source-to-source widget which would parse the pegjs file, extract the JS action rules, then pass those over to the js2php tool, and splice them together on output. Given that pegjs has a good & simple grammar to parse rules already, this should be a simple helper to write.

If there are difficulties here, we might consider some of the other ideas we've had previously to simplify the JS action rules in our tokenizer, for example to declaratively specify the cache management we do. That would be worthwhile iff it allows us to move difficult-to-automatically-port JS out of the action rule clauses.

I think option 3 (just rebase phppegjs on top of Tim's fork) is the shortest/safest way to go.

I am leaning towards a variant of this which is to simply port the ast-to-regalloc-js.js to generate PHP code. To do that, we'll benefit from the already existing phppegjs code to find equivalent PHP constructions.

But note https://github.com/nylen/phpegjs/pull/1 which we'll have to address

I don't think we have rules affected by this? In any case, this may not be such a serious issue in that as part of porting our tokenizer peg grammar, we can convert over any rules that might be affected ... this of course means that we will not have an exact 1-1 correspondence. But, that is where a testing framework will save us by catching problems early.

I've just been looking at the phpegjs code. Performance will be extremely sensitive to character class matching. Ideally, that should be inlined, instead of split out to a runtime library function peg_char_class_test(). If the class is purely ASCII, then it should be possible to consume text without doing UTF-8 parsing. For example, Parsoid has:

url_protocol =
    & { return Util.isProtocolValid(input.substr(endOffset()), env); }
    p:$( '//' / [A-Za-z] [-A-Za-z0-9+.]* ':' '//'? ) { return p; }

A reasonably efficient implementation of the second character class would be:

$ord = isset( $input[$pos] ) ? ord( $input[$pos] ) : -1;
$match = ( $ord === 45 ) || ( $ord >= 65 && $ord <= 90 ) || ( $ord >= 97 && $ord <= 122 ) || ( $ord === 43 ) || ( $ord === 46 );

Another possibility is to compile character classes to PCRE, then it becomes:

$match = !!preg_match( '/[-A-Za-z0-9+.]/A', $input, $m, 0, $pos );

In phpegjs it is:

peg_char_class_test( [[45, 45], [65, 90], [97, 122], [43, 43], [46, 46]], $this->input_substr( $pos, 1 ) );

Where peg_char_class_test() does UTF-8 interpretation of the input string and loops over the class array, and input_substr() is inexplicably slow and wrong.

I wonder if it would be better to port the PEG.js compiler to PHP? At least, once the migration is done, we won't want to have a node.js build step.

Maybe @ssastry is right that I should look at this myself, it is kind of fun to think about.

Thanks. Makes sense. That peg_char_class_test business seems broken. But, if we use ast-to-regalloc-js.js as the basis for porting.... that file has a ".test(." against a regexp which presumably would translate over directly to a preg_match in php-land with the regexp instead of the loop as in phppegjs.

I wonder if it would be better to port the PEG.js compiler to PHP? At least, once the migration is done, we won't want to have a node.js build step.

You mean port all of peg.js to php? That seems like a larger undertaking than necessary, no? The build step is entirely on the dev machines and the built php tokenizer can be checked into the git repo. Or at least, just porting ast-to-regalloc-js.js to ast-to-regalloc.php.js would be a good first step?

Maybe @ssastry is right that I should look at this myself, it is kind of fun to think about.

I won't complain. :-)

We have since settled on option 3. and rebranded the fork as WikiPeg.

WikiPeg work is tracked in T219337: Port Parsoid tokenizer to PHP, T219339: PHP generator backend for WikiPEG, and T219341: Cross-language test suite for WikiPEG.