[Spike 6hrs] Investigate ability of vivliostyle to render single articles
Closed, ResolvedPublicSpike

Description

We would like to evaluate the ability of vivliostyle to render single paged articles. Namely, the following questions:

  • How does it perform when rendering tables
  • Can it provide page numbers?
  • Can it provide support for blue links/links to other articles/other hyperlinks?
  • Are there other noted edge cases where vivliostyle breaks?
  • Is there support for a two-column layout?
  • Can it use the same(core) css file for print styles?

Also, as a result of the spike, provide rendered articles.

Note:
Here's a list of test pages that use various templates/tables/scripts:
https://en.wikipedia.org/wiki/Berlin
https://en.wikipedia.org/wiki/Trigonometric_functions
https://en.wikipedia.org/wiki/Climate_of_Australia
https://en.wikipedia.org/wiki/Santiago
https://zh.wikipedia.org/wiki/%E5%9C%A3%E5%9C%B0%E4%BA%9A%E5%93%A5_(%E6%99%BA%E5%88%A9)
https://ar.wikipedia.org/wiki/%D8%B3%D8%A7%D9%86%D8%AA%D9%8A%D8%A7%D8%BA%D9%88
https://ru.wikipedia.org/wiki/%D0%A1%D0%B0%D0%BD%D1%82%D1%8C%D1%8F%D0%B3%D0%BE

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
ovasileva moved this task from Incoming to 2014-15 Q4 on the Web-Team-Backlog board.

Here are some quick tests with the standalone vivlio viewer (http://vivliostyle.github.io/vivliostyle.js/viewer/vivliostyle-viewer.html#x=<url>; note <url> must support CORS so RESTBase works but plain Wikipedia doesn't) + Chrome "Save as PDF" feature (default settings). This is not necessarily indicative of all that's possible with vivliostyle.js, for nicer display some CSS needs to be embedded into the page (e.g. they have some examples of doing TOC/page numbers that way). It might be good for spotting problems though.

Results:

  • infobox styling disappears. The text is still visible and has a table-like layout but table borders and background colors disappear. (Many other things such as thumbnails also become very plain.)
  • the text does not flow around floated elements
  • weird page break after the disambiguation template in Trigonometric functions
  • fractional notation breaks in a weird way
  • no two-column layout for references
  • navboxes look extremely broken (e.g. end of Santiago)
  • galleries look very broken as well

Most of this looks like over-eager style normalization. Question is, is it a core feature that vivliostyle.js relies on for pagination, or just some default CSS stylesheet that can be stripped away?

infobox styling disappears. The text is still visible and has a table-like layout but table borders and background colors disappear. (Many other things such as thumbnails also become very plain.)

With the "background graphics" print option enabled, some table styles are preserved but it's not a huge improvement.

Vivliostyle seems to be under the AGPLv3. You should check with Legal on whether that's acceptable and what restrictions would it place on us, both in general and in particular in the case of PDF output.

Results:

  • infobox styling disappears. The text is still visible and has a table-like layout but table borders and background colors disappear. (Many other things such as thumbnails also become very plain.)
  • the text does not flow around floated elements
  • weird page break after the disambiguation template in Trigonometric functions
  • fractional notation breaks in a weird way
  • no two-column layout for references
  • navboxes look extremely broken (e.g. end of Santiago)
  • galleries look very broken as well

Most of this looks like over-eager style normalization. Question is, is it a core feature that vivliostyle.js relies on for pagination, or just some default CSS stylesheet that can be stripped away?

@Tgr - Reviewed the PDF's and your observations are pretty similar to mine. A few more things:

General questions:

  • Can we add title to the page?
  • Can we display the table of contents
  • Will hyperlinks work?
  • Can we have page numbers on single-page articles?
  • Page size seems strange (square)

Trig functions

  • In addition to fraction notation, formulas after line breaks, $sinA = ...$ on the bottom of page 6 for example, seems to be rendering in bold.
  • The bottom of the infobox seems to have disappeared.

سانتياغو

  • There seems to be a floating scrollbar on page 2
  • Image position doesn't seem related to text position, images are floating when they are not in the original article

@Tgr - how much of the above do you think we can fix with our css? Also @Nirzar - could you also review?

@Tgr - also, do we know what happened with Chinese?

I spent some ours trying to extract the internal logic from the vivliostyle app but did not get far. The UI and the print simulation are separate libraries, but the internals are undocumented and rather complex (plus it uses knockout and google compiler so debugging is not a lot of fun). It would take significantly more effort; asking the vivliostyle developers is probably a better approach.

  • Can we add title to the page?
  • Can we display the table of contents
  • Can we have page numbers on single-page articles?

These should be doable, some demo vivliostyle documents do them.

  • Will hyperlinks work?

Probably if we rewrite the original document to have internal links in the form of "#foo" and not "./Page_name#foo". I'll give it a try.

  • Page size seems strange (square)

There are configuration options for that (in the vivlio app, the blob of color on the upper right).

  • In addition to fraction notation, formulas after line breaks, $sinA = ...$ on the bottom of page 6 for example, seems to be rendering in bold.

That's a Chrome issue, I see the same if I just print the RESTBase page directly. No idea if that can be tweaked with CSS.

  • The bottom of the infobox seems to have disappeared.

Oh yeah, I have seen that elsewhere too. It happens sometimes when a table does not fit a single page.

  • There seems to be a floating scrollbar on page 2

And the text is clipped. This is also a generic Chrome thing. In the end probably a styling error / antifeature in the infobox: someone set that table cell to have limited height and the browser obeys it.

also, do we know what happened with Chinese?

No clue. The error is very generic (some kind of internal logic error in a size estimation algorithm), but I can't even find the non-minified version of that code.

how much of the above do you think we can fix with our css?

No idea really. I believe most of this is just the vivliostyle stylesheet agressively overriding Wikipedia styles, but whether that's just something they do because it works well on average for non-print-optimized web pages, or they are disabling CSS features that would otherwise confuse their pagination logic... I can mess randomly with the stylesheet in a local instance, but maybe it's more effective to just ask the vivliostyle developers and/or file the issues above as bugs.

@Tgr - it sounds like our next step should be to reach out to them directly, in parallel to trying to change the stylesheet a bit and see what happens. @GWicke - do you know if we have a point of contact at vivliostyle?

@Tgr - also, do we know what happened with Chinese?

T167603 afaik

@GWicke - do you know if we have a point of contact at vivliostyle?

As mentioned in an earlier mail, http://vivliostyle.com/en/about/ lists Florian Rivoal and Johannes Wilms as evangelists. I haven't talked to them yet.

From what I can see in the Chrome inspector, the Wikipedia print styles aren't loaded at all in the basic vivlio viewer. It seems likely that this should be an easy thing to fix.

Change 362319 had a related patch set uploaded (by Gergő Tisza; owner: Gergő Tisza):
[mediawiki/extensions/Collection@master] [WIP] Render TOC with vivliostyle

https://gerrit.wikimedia.org/r/362319

From what I can see in the Chrome inspector, the Wikipedia print styles aren't loaded at all in the basic vivlio viewer.

I think most of the breakage is due to site styles (Common.js etc) not working. Might be some kind of CORS issue although I saw no errors.

Anyway, I set up a local vivliostyle instance to use the local wiki and that worked pretty well. sample pdf
The one big problem that remains is large tables getting cut off. (Sometimes divs too, it might depend on some CSS property. Did not have time to experiment.)


Steps for testing:

  • check out https://gerrit.wikimedia.org/r/#/c/356991/ in vagrant, enable the offline role, provision
    • might run into T166953 in the process, just re-run npm install --no-bin-links manually in the affected service directories in that case.
  • check out https://gerrit.wikimedia.org/r/#/c/362319/ in the Collection extension, run composer update
    • I just hardcoded the vagrant port number in SpecialRenderBook line 70, so you'll have to change that. The proper way would be using Electron via VirtualRESTService, but the restbase vagrant role seems to be broken.
  • enable CORS for non-API requests (e.g. with P5653)
  • create some test pages with content identical to enwiki. (The patch fetches HTML from enwiki since getting decent HTML rendering in a vagrant box is near impossible, but other stuff like sections are fetched from the local API.) Also copy MediaWiki:Common.css from enwiki as templates will look very broken without it.
  • add a collection the usual way
  • visit Special:RenderBook.

For convenience, here is an export file with Common.css and some test files:


and a script for setting up the collection: P5654

Also note that using vivliostyle with Electron did not work at all, it just says Render failed. (Printing the vivliostyle page from my browser worked fine.) Again, I did not have time to look into that. Since Vivliostyle Formatter (their commercial PDF rendering tool built on top of Vivliostyle Viewer) itself uses Chrome for printing, I imagine that's not insurmountable.

Locally rendered test PDF for the other articles in the task description: F8618676, part 2 (cut in two because vivliostyle chokes on the Australia article). Checking for the comments above:

  • Styling of infoboxes, navboxes and other complex templates looks fine, with some small differences (such as a missing border) which are likely unintentional differences between screen and print version of the local styles maintained by the wiki community.
  • Floats work, although many elements that are floated in the original article are not in the PDF (again probably a stylesheet issue).
  • The page break at the beginning of Trigonometric functions is still there. No obvious reason for it in the CSS; maybe caused by the next element having a clear property?
  • Fractional notations look fine.
  • References are displayed in a single column - that's actually done by Wikimedia print styles, not related to Vivliostyle.
  • Galleries still look broken.

Other remarks:

  • The alt text for the images is displayed like a caption. Seemed useful for the specific examples I looked at.
  • The zhwiki article works fine when I use the locally installed version of vivliostyle.
  • Interestingly, some tables are sliced up just fine (see bottom of page 97 of the second test file). Others still get cut off though.

Other remarks:

  • The zhwiki article works fine when I use the locally installed version of vivliostyle.

Are you sure

聖地亞哥(西班牙语:-{Santiago}-;或译聖雅各),也稱為圣地亚哥-德智利(西班牙语:-{Santiago de Chile}-;意為「智利的聖地亞哥」),是智利的首都和最大城市,也是聖地亞哥首都大區的首府。其座落於智利中部的中央谷地(英语:-{Chilean Central Valley}-),海拔約520公尺。「-{Santiago}-」在西班牙文中即「圣雅各」之意(拉丁語原文「-{Sanct-Iacobi}-」)。

Is working for you? We really shouldn't show -{ and }- on vivliostyle too.

Filed some bugs:

We really shouldn't show -{ and }- on vivliostyle too.

That's a bug in Parsoid (T43716), very close to being fixed I believe.

@Tgr, a few more notes:

On the article in Arabic, it seems that the text is left-aligned versus right-aligned, I'm unsure if this is due to our current print styles. Tables currently seem to be the greatest offender. Should we create a separate task for getting them to work?

On the article in Arabic, it seems that the text is left-aligned versus right-aligned

The test in T168004#3361406 looks right. So either vivliostyle can't fully handle Unicode text direction on its own, wiki pages add some styling that override that and make the test look OK, and the override gets lost during article concatenation; or concatenation breaks text direction in some other manner. In any case, should be easily fixable on our side.

Tables currently seem to be the greatest offender.

IMO the greatest offenders are JS layouting errors which cause part or all of the content to not appear at all (vivliostyle issues #368 and #369, and I'm sure we'll find more if we test more than a handful of articles). We'd have to rely on vivliostyle developers to get those fixed, so if we decide to use vivliostyle we might want to set up some sort of support arrangement with them.

The second greatest offender so far is vivliostyle's inability to recover from CSS errors (which are often placed into our CSS stylesheets intentionally, to target old browsers with incorrect handling of CSS). Again something that we would have to rely on vivliostyle developers to fix. (@Anomie just wrote a shiny new CSS validator for TemplateStyles; if we really want to, we could use that to remove invalid CSS, maybe as a new Resourceloader mode (similar to how debug=1 works). But it does not seem like a reasonable approach.)

The other issues we have seen so far do not seem like a dealbreaker to me and even if they are, CSS workarounds are probably possible (such as making sure no large containers are floated in print mode).

@Tgr - sounds good. Do you think you could reach out to vivliostyle with some of the main questions/issues?

Tgr removed Tgr as the assignee of this task.Jul 5 2017, 6:18 PM
Tgr subscribed.

@Tgr - sounds good. Do you think you could reach out to vivliostyle with some of the main questions/issues?

Done.

This went way beyond a spike but it's mostly done now - we have an idea of how Vivliostyle-generated articles look, and how reliable it is. The missing part is to use it together with Electron - that did not work for me but I only put in minimal effort and others use it that way (there is even an npm package) so it can't be too hard. I ran out of time so someone else will have to look into that though.

From what I can see in the Chrome inspector, the Wikipedia print styles aren't loaded at all in the basic vivlio viewer.

I think most of the breakage is due to site styles (Common.js etc) not working. Might be some kind of CORS issue although I saw no errors.

FWIW that turned out to be issue #200.

@bmansurov - based on all of the above, do you feel like you have clear steps picking this up from the work @Tgr did? Mainly into investigation how much of the vivliostyles bugs we can fix from our side?

@ovasileva I'm not exactly sure because I'll have to get the same thing running locally first. This will give me a better picture. I don't see it as a problem, it's just I don't have the full picture in my mind yet. I guess we can always ping Gergo if I have further questions.

@bmansurov - sounds good. In terms of next sprint then, let's pull the spike in as it is, and you can change the description as you see fit once you know a bit more

Here is an alternative Berlin PDF that was generated from a local wiki. This one includes a table of contents

but has multiple other problems such as missing page numbers on pages and chopped infobox.

Here is an alternative Berlin PDF that was generated from a local wiki. This one includes a table of contents

but has multiple other problems such as missing page numbers on pages and chopped infobox.

@bmansurov, @Tgr - this version seems more broken than the one @Tgr created last. Do we know what the differences were?

Chopped infobox is a limitation of vivliostyle. Probably can be worked around by tweaking the CSS of the infobox (such as making sure it does not float). Not sure about the page numbers, that should work. They are defined in the same CSS file where the TOC numbers are, and those work so a syntax error causing the vivliostyle CSS parser to bail out is unlikely. Maybe rendering did not finish yet? I see some of the TOC ?? marks are also not replaced, and that can take a surprisingly long time to happen. Or maybe it errored out, in which case there should be a JS error on the console.

Discussed options on working with vivliostyle to resolve current bugs. Blocked on an estimate of difficulty and time from the vivliostyle team.

ovasileva changed the task status from Open to Stalled.Jul 26 2017, 5:25 PM
ovasileva moved this task from 2017-18 Q1 to Triaged but Future on the Web-Team-Backlog board.
phuedx claimed this task.
phuedx subscribed.

AFAIK we're no longer considering vivliostyle.

Change 362319 abandoned by Gergő Tisza:
[WIP] Render TOC with vivliostyle

Reason:
We have decided not to use vivliostyle.

https://gerrit.wikimedia.org/r/362319

Restricted Application changed the subtype of this task from "Task" to "Spike". · View Herald TranscriptNov 13 2021, 8:49 PM