Evaluate and document performance of one or two DOM transformers in node.js vs PHP
Closed, ResolvedPublic

Description

In order to get a handle on the performance implications of porting Parsoid to PHP, we need to evaluate the different components of Parsoid.
One of these components in the wt -> html direction are the various DOM transformers.
Initial experiments in Feb 2018 indicated that the performance of DSR computation (after porting that to PHP) is faster in PHP - likely due to the C-backed DOM implementation.

This task is to do some of that evaluation a bit more systematically on the DSR pass and maybe one other pass and document the results here.

Event Timeline

Here are results on multiple pages on 3 different transforms. 50 iterations each. Script that produced raw numbers is enclosed further below.

The test pages and how the output Parsoid HTML size stacks up relative to all wt parses on the Wikimedia cluster): Skating (< p50); Hospet (p50 - p95), Hampi (> p95), Berlin (p99), Barack_Obama (> p99) .

Using Skating HTML size as the base unit, the relative HTML sizes are as follows: Skating = 1; Hospet = 3; Hampi = 50; Berlin = 100; Barack_Obama = 200.

The net result seems to be that on smaller pages, PHP version is faster than the JS version and on larger pages, JS version gets faster than the PHP version. But, on a typical page, it seems that the PHP and JS versions might even out. However, different passes yield different results. A compute-intensive pass like DSR computation that visits every node and does a bunch of work is slower in PHP. But, a compute-lean pass like section wrapping is always much faster in PHP -- likely because the DOM API in PHP is C-backed. That is a good thing to know and maybe a good reason to consider maybe implementing the PEG tokenizer in C instead of PHP. (/cc @tstarling, @Anomie, @Tgr)

---------------------------------------------
Page         Pass            JS         PHP
---------------------------------------------
Skating (HTML-size  < p50; wt-size =~ p50)
             dsr           53.276     55.942
             pwrap         25.143      8.149
             sections      39.222      8.746
     ---------------------------------------
     total for 1 iter       2.353      1.457
     ---------------------------------------
                      
Hospet (p50 < HTML-size < p95; p50 < wt-size < p95)
             dsr           86.601    111.627
             pwrap         35.014     15.288
             sections      70.247     20.382
     ---------------------------------------
     total for 1 iter       3.837      2.946
     ---------------------------------------
                      
Hampi (p95 < HTML-size < p99; p95 < input-size < p99) ... closer to p95
             dsr          718.266   1374.858
             pwrap        136.278    148.798
             sections     665.684    295.518
     ---------------------------------------
     total for 1 iter      30.404     36.383
     ---------------------------------------
                      
Berlin (p95 < HTML-size < p99; p95 < input-size < p99) ... closer to p99
             dsr         1840.804   3494.870
             pwrap        245.237    303.308
             sections    1160.667    612.505
     ---------------------------------------
     total for 1 iter      64.934     88.214
     ---------------------------------------
                      
Barack_Obama (p99 < HTML-size; p95 < wt-size < p99)
             dsr         1904.729   3799.037
             pwrap        284.214    399.684
             sections    1749.297    976.605
     ---------------------------------------
     total for 1 iter      78.765    103.506
---------------------------------------------

Script to produce these results (by checking out the php-prototype branch in Parsoid). @Sbailey worked on the domTests.js script and the --genTests option in parse.js. I am using that to generate numbers here.

#!/bin/bash

for page in Skating Hospet Hampi Berlin Barack_Obama
do
	if [ "$1" == "js" ]
	then
		echo "bin/parse.js --wrapSections --useBatchAPI --genTest dom:dsr,dom:pwrap,dom:sections --pageName $page --genDirectory /tmp/ < /dev/null > /dev/null"
		bin/parse.js --wrapSections --useBatchAPI --genTest dom:dsr,dom:pwrap,dom:sections --pageName $page --genDirectory /tmp/ < /dev/null > /dev/null
	fi
	for transformer in dsr pwrap sections
	do
		if [ "$1" == "js" ]
		then
			echo "bin/domTests.js --inputFile /tmp/$page --timingMode --iterationCount 50 --transformer $transformer"
			bin/domTests.js --inputFile /tmp/$page --timingMode --iterationCount 50 --transformer $transformer
		fi
		if [ "$1" == "php" ]
		then
			echo "php/bin/domTests.php --inputFile /tmp/$page --timingMode --iterationCount 50 --transformer $transformer"
			php php/bin/domTests.php --inputFile /tmp/$page --timingMode --iterationCount 50 --transformer $transformer
		fi
	done
done