Typesetting Engines: A Programmer's Perspective

blog.ppresume.com

146 points by P_qRs 4 days ago

> Indo-European languages: a language family native to the overwhelming majority of Europe, the Iranian plateau, and the northern Indian subcontinent. Widely spoken indo-european languages includes English, French, Portuguese, Russian, Dutch, and Spanish, etc.

> Indo-European languages typically use the Latin alphabet

After the first sentence, "Indo-European" seems to have transformed to just "European" in the author's mind. Hindi and Bengali, languages more widely spoken than half the language in that list, seem to have been forgotten, along with their Devanagri script.

(Over the course of the article, it's seeming like the author just wanted to say European languages, or languages using Latin script, and for some reason chose to use Indo-European instead, despite clearly stating the definition themself.)

xiaohanyu 4 days ago

Thanks for pointing this out.
Yes you are right, I am not a linguist so I have little knowledge for "Indo" languages.
Originally I adopted the word of "Germanic languages" then I found Spanish is not a Germanic language hence I then adopted "Indo European" language.
This needs a fix for sure.
- messe 3 days ago
  
  > Originally I adopted the word of "Germanic languages" then I found Spanish is not a Germanic language
  Also worth noting that out of those you listed, Russian is also not a Germanic language (it's Slavic), and does not use the Latin alphabet.
cafard 4 days ago

Not just "European", Western European. A whole lot of people read and write with the Cyrillic alphabet.
- Tainnor 4 days ago
  
  And then there's Greek.
  - bafe 3 days ago
    
    And Armenian, which is also Indo-European and uses its own alphabet
  - vkazanov 3 days ago
    
    And Greek alphabet is sort of a base for both major groups of European alphabets, Latin and Cyrillic.
    Can we just say greek-based alphabets?
    
    messe 3 days ago
    
    Let's just go further back, and say Egyptian hieroglyph derived.
    
    vkazanov 3 days ago
    
    Surprisingly, I wanted to correct you but the phoenician alphabet is thought to be derived from Egyptian as well.
    So egyptian it is.
- xiaohanyu 4 days ago
  
  roger that, thanks!
chrismorgan 4 days ago

My understandings, as one very familiar with Indic scripts, very familiar with Unicode in general, but not a CJK user, so please correct me if I’ve blundered:
• Indic scripts need the renderer to support complex text shaping, or else the text will generally be illegible, as though you were drawing your letters wrong, stacking some vowels on top of each other, and other nonsense things like that. As an example, if you’re not familiar with Indic scripts, see the code points used to write my name in Telugu, and how they contribute to the rendering: https://temp.chrismorgan.info/%E0%B0%95%E0%B1%8D%E0%B0%B0%E0.... It’s basically “letter ka, delete the vowel, letter ra, delete the vowel, add vowel i, letter sa, delete the vowel”, but the “kri” will normally be joined together into a conjunct, with the vowel sign drawn on the first consonant, and the second consonant being drawn in a completely different way from normal, which may even affect layout by font—the r conjunct can be a semicircle below, as in that font, but it can also be a curve beginning on the left, shifting the k to the right. (Me, I like the curve style for no particular reason, but the semicircle seems more popular these days. If this concept seems weird to you, reflect that English has allographs too <https://en.wikipedia.org/wiki/Allograph>, though mostly not particularly affecting layout.)
• But as regards line breaking, Indic scripts are much the same as English.
• CJK shaping/rendering can have a bit of complexity because of Han unification <https://en.wikipedia.org/wiki/Han_unification>, and definitely has a lot more nuanced stuff like mixing horizontal and vertical writing modes, and what to do when you mix scripts (which happens much more than with Indic scripts), especially digits, and especially when combining vertical and horizontal. But if your engine doesn’t support any of this, your document should still at least be fully intelligible—just uglier.
• CJK line breaking is awful: where most languages have settled on using spaces to separate words, most CJK languages mostly don’t (Korean does, I believe), and so you pretty much need to know the language to avoid breaking in the middle of words. So you end up things a bit like hyphenation dictionaries to try to do a good-enough job of it. Again, if your engine doesn’t support this, your document should still be intelligible—just uglier.
- nicoburns 4 days ago
  
  That graphic of the Indic glyph is very interesting! Definitely explains why shaping is so complex for those scripts!
  Regarding CJK line-breaking, my understanding was that it was only Thai and closely related languages that required dictionary-based line breaking, and the Chinese/Japanese had simpler rules mostly concerning punctuation. But I'm not certain about that.
  - doabell 4 days ago
    
    Yes, for Chinese & Japanese, not breaking words is nice, but not always practical. Maybe if you’re writing a speech, so as not to mispronounce the word in the 5% of cases when that happens. The CSS line-break property pretty much sums up the actual rules. Some apps do ship a dictionary to allow for double-click selection of words. They don’t always get it right, though.
  - BeFlatXIII 3 days ago
    
    > That graphic of the Indic glyph is very interesting! Definitely explains why shaping is so complex for those scripts!
    So _this_ must be why the Affinity suite doesn't properly render Devanagari, yet Inkscape can.
- hsfzxjy 3 days ago
  
  > CJK line breaking is awful
  It's not true for Chinese. Chinese allows line breaks after any characters.
  - chrismorgan 3 days ago
    
    My impression (again, open to correction) was that, although that's true, there are many places where breaking is not preferable, like how you can hyphenate in English but should prefer not to. Many in Japanese, basically needing a dictionary, and fewer in Chinese but still some.

rikroots 4 days ago

The most painful issues I encountered when building out a text layout engine for my JS 2D canvas library were:

- Vertical text - in particular when it comes to how CJK punctuation differs in horizontal and vertical environments (not yet solved)

- Staying on CJK, making sure the punctuation marks that follow a character don't break and remain with their preceding characters at all times. (I expect the same holds for opening quotes etc but haven't experimented).

- Highly ligatured fonts - Devangari, Arabic, etc - there's no solution to styling individual characters within a word that I could find.

- Talking of styling ... underlining text is a nightmare - especially if you want to get the little gap between hanging characters and the underline that HTML/CSS browsers do out of the box

- Formatting Thai fonts ... is another World of Hurt[1]

[1] - https://w3c.github.io/sealreq/thai/

chearon 4 days ago

Paste the following into https://chearon.github.io/dropflow/ to see that it _is_ possible to style individual Arabic characters in canvas:
<div style="font-size: 10em;">ع<span style="color: blue;">ر</span>ب<span style="color: red;">ي</span></div>
That uses harfbuzzjs to do shaping and, for the segments that it has to, it paints paths instead of using fillText. There is an even better method which Mozilla's pdfjs uses: for all the glyphs that you want to draw, build a font (easy with HarfBuzz) that maps sequential characters to those glyphs. Then use fillText with that font and the character that corresponds to the glyph that you want for each glyph. That's nearly as fast as fillText on the whole string.
The points you make are really important. I rant about how even Google Sheets doesn't do rich text correctly because of fillText's simplicity here: [1]. But I think many of your points could be solved by using HarfBuzz. I dream of having shapeText and fillGlyphs methods on the canvas as an alternate to HarfBuzz because it would be less wasteful. Leave the high-level APIs up to client-side libraries like dropflow and scrawl.
Google has proposed a placeElement method [2] that allows you to render HTML and CSS into a canvas, but that destroys what's so great about canvas, which is that it's crazy fast. DOM is very heavy-weight.
[1] https://github.com/chearon/dropflow#harfbuzz [2] https://github.com/WICG/canvas-place-element
- chrismorgan 3 days ago
  Sadly that doesn’t work with Indic text. Take for example, my name:
  <div style=font-size:10em><span style=color:#e01b24>క</span><span style=color:#5e5c64>్</span><span style=color:#e66100>ర</span><span style=color:#e5a50a>ి</span><span style=color:#26a269>స</span><span style=color:#1a5fb4>్</span></div>
  Ideally it’d render about the same as my manual splitting/colouring: https://temp.chrismorgan.info/%E0%B0%95%E0%B1%8D%E0%B0%B0%E0....
  Now sometimes you can colour parts differently: if you stick with the inherent vowel on a conjunct, here making it LETTER KA, SIGN VIRAMA, LETTER RA, ditching the VOWEL SIGN I, then essentially the K from KA and the A from RA will be coloured LETTER KA, and the R from LETTER RA will be coloured SIGN VIRAMA. I haven’t decided yet if that’s an improvement! But this split colouring I only see in Dropflow—I’m not experiencing it in Firefox or Chromium, both of which do split Arabic colouring.
  - chearon 3 days ago
    
    That's because Noto Sans Telugu (the font dropflow automatically downloaded based on the text; doesn't seem like a great pick but not wrong?) is returning ligatures. Not all fonts will be able to support styling individual characters. I get the same results in the dropflow playground as I get in Firefox and Chrome [1]. Maybe you were using different fonts in those browsers?
    You might be able to turn OpenType features off in those browsers to make it look like your manual coloring, I'm not sure.
    > if you stick with the inherent vowel on a conjunct, here making it LETTER KA, SIGN VIRAMA, LETTER RA, ditching the VOWEL SIGN I, then essentially the K from KA and the A from RA will be coloured LETTER KA, and the R from LETTER RA will be coloured SIGN VIRAMA. I haven’t decided yet if that’s an improvement!
    Maybe because it changes the shaping results? I don't know enough about the writing system to understand this yet :)
    [1] https://jsfiddle.net/fz15xu20/
- rikroots 3 days ago
  
  > Google has proposed a placeElement method that allows you to render HTML and CSS into a canvas
  I've just read it. Oh, dear ... what an awful proposal!
  My initial thoughts, reacting to the README at https://github.com/WICG/canvas-place-element
  > There’s a strong need for better text support on Canvas. [...] This includes not only visual features but also the possibility of supporting the same level of user interaction as the rest of the web
  Agreed, but ... adding HTML/CSS to a raster (or WebGL etc) image is not the way to do it. I much prefer your idea of incorporating HarfBuzz-like functionality into the canvas/text APIs - especially given that HarfBuzz is included as part of most browsers' code base.
  I've played with mixing HTML/CSS with canvas in my canvas library. The results are ... interesting[1][4], but (probably) not the ideal solution. Making it easy for developers to build canvas interactions with HTML/events anywhere on the page is much more productive and useful[2].
  > There is currently no guarantee that the canvas fallback content currently used for accessibility always matches the rendered content, and such fallback content can be hard to generate.
  My thinking is that directly reflecting the canvas text back into the DOM is often not useful for people using screen readers. They don't need to hear every number on the chart axis when instead they could be presented with just the measure and range of the axis. If the HTML element rapidly/repeatedly updates its content, it's going to be a very unpleasant experience for the user. I've experimented with this sort of thing in [2], but need feedback from real screenreader users to understand if the solution meets their needs.
  > Access to live interactive forms, links, editable content with the same quality as the web. This will help to close the app gap with Flash.
  Flash is dead. Please leave its bones in the crypt.
  > A limited set of CSS shaders, such as filter effects, are already available, but there is a desire to use general WebGL shaders with HTML
  I only work with 2D canvas, but I can understand the desire here. A different approach might be to convince browser devs to work on improving SVG (and CSS) filters to support WebGL shaders, which can then be used by the canvas? Though Safari still doesn't support using SVG filters in the canvas so maybe convince them first?
  Playing with filter effects is one of the joys I get from working on my canvas library[3] - but that's got nothing to do with text layout ... except when applying the filter to text, of course![4]
  [1] - Use stacked DOM artefact corners as pivot points https://scrawl-v8.rikweb.org.uk/demo/dom-015.html
  [2] - London crime charts https://scrawl-v8.rikweb.org.uk/demo/modules-001.html
  [3] - A gallery of compound filter effects https://scrawl-v8.rikweb.org.uk/demo/filters-103.html
  [4] - Editable header text colorizer and animation effect snippets https://scrawl-v8.rikweb.org.uk/demo/snippets-006.html
  - chearon 3 days ago
    
    Interesting take on representing charts in an accessible way. I feel like, as web developers, we were fed a myth that lots of markup and attributes automatically makes your content accessible. But it takes more thinking than that.
    I've taken a similar approach to layering canvases with normal HTML (typically the HTML is on top). I don't have a problem logically representing what's painted on the canvas and doing my own hit detection either. Shaders and text shaping in canvas sound a lot more attractive to me than placeElement, but I guess we'll see. I should get around to campaigning for ctx.shapeText and ctx.fillGlyphs but I don't know how much folks care about it.
invalidname 4 days ago

No bidi or Arabic script complexities? Calculating the line break in an RTL situation is just terrible...

liendolucas 3 days ago

Surprised that groff is not mentioned. I've recently used it to typeset my CV and oh boy... It feels like something really really arcane. Despite fighting with it at the beginning I'm quite happy for the result and also like that I can now version the code (no more LibreOffice for that). The key difference between typesetting and textprocessing is that typesetting is like programming a document, you have basically access to common functionality offered by programming languages, which of course gives you a lot of power and flexibility. The thing I do complain about groff is its documentation (man pages are like very crude to follow on commands) and also there are almost no resources (tutorials, recipes, guides) out there.

xiaohanyu 3 days ago

I know groff, at least the bible K&R C programming book is published by groff, if I am not wrong.
For me, I think groff/LaTeX/SILE/typst all belongs to same category, i.e, the author write some markup language, then processed by some processors, then get an output. I chose LaTeX and Typst as the classic ones in my post:
- LaTeX is the classic, old school typesetting engine - Typst, clearly more modern, with many advanced design like incremental compilation, wasm and web app, instant preview, better error message.
For others, groff/SILE, to be honest I don't have time to dive into each of these.
----
Do you see any advantages of groff over LaTeX?
- liendolucas 3 days ago
  
  I haven't used LaTeX in ages to be honest. I guess I could give it a shot and compare pros/cons between them. If I'm not wrong also the book "The Go Programming Language" by Kernighan and Donovan has also been typeset with groff (I think there's actually an email sent by someone to Kernighan mentioning how beautifully the book was typeset and he explains why he didn't choose LaTeX and leaned towards groff). The main reason why I used it was because is a tool that has been in unix systems for ages and was simply curious about it and re-writting my CV with it was an excuse to give it a shot.
giraffe_lady 3 days ago

You might be the only person to write groff by hand in a generation.

thangalin 4 days ago

ConTeXt[1] is monolithic typesetting software that I've integrated into my Markdown editor[2]. I find ConTeXt allows for a complete separation of content and presentation. This makes it possible to write a novel in Markdown and produce various styles: a formatted PDF and a manuscript PDF. See the themes output in the screenshots[3]. Too bad the authors didn't evaluate LuaTeX or ConTeXt.

[1]: https://wiki.contextgarden.net/Installation

[2]: https://keenwrite.com/

[3]: https://keenwrite.com/screenshots.html

xiaohanyu 4 days ago

Just for keenwrite, from the screenshot: https://keenwrite.com/images/screenshots/05.png, seems that keenwrite doesn't implement Knuth Plass line breaking algorithm?

__mharrison__ 4 days ago

I've written and published over a dozen books. (Two published with big tech publishers, the rest self-published.)

With my most recent book, I've moved my PDF generation to Typst. LaTeX, you served me well, but I'm more than happy to never use you again. Typst is better (or decent enough) in every dimension.

xiaohanyu 4 days ago

Just curious, what is the best part you love typst most over LaTeX?
I guess: 1) the incremental compilation speed, 2) the modern user experience (better error message things, better syntax, etc)?
- __mharrison__ 4 days ago
  
  I feel like LaTeX is very hard to really learn. I wrote it for years without really understanding it. (Still don't).
  I asked what was the best book for learning LaTeX. The response... "There is no book. Sit next to someone writing their dissertation."
  Typst on the other hand, feels modern, is readable, and is fast. (4 seconds vs 2 minutes for some of my books.) The developers are responsive.
  My only complaint is that some of my code broke during the latest release. I'll not complain too much because of is a nascent project and still making quick progress.
  After I realized that Typst had the features I required for my books, I immediately moved to it.
  Good riddance LaTeX. You served me well, but I felt like there was never a better option... Until now.
endgame 4 days ago

Typst looked promising, but the very first thing I wanted to do with it - generate some slides with code snippets, and use highlights to call out specific features of the code - is not possible. The core layout engine seems to merge `styled(...)` spans in code blocks far too aggressively, making it impossible for codly (the code highlighting package I tried) to pick out precise ranges to highlight.
I went back to beamer.
- Onawa 4 days ago
  
  Look at Quarto. Write in markdown and export to web, print, and presentations (including straight to PowerPoint or reveal.js for interactive web-based slides.
  All from the same content using includes, variables, flags, etc. Show interactive plots directly in you presentations, tons of other features.
  Almost every project I create now whether it's documentation, presentation, report, website, or anything else project-related can fit within `quarto create project`.
- chaxor 4 days ago
  
  This is pretty hard to believe considering how good the ecosystem in Typst is so far. There are quite a few packages for making slides in Typst including [Polylux](https://github.com/gszauer/polylux), [Touying](https://github.com/Wyntau/touying], [minideck](https://typst.app/universe/package/minideck), [slydst](https://typst.app/universe/package/slydst) [minimal-presentation](https://typst.app/universe/package/minimal-presentation) and many more.
  - endgame 4 days ago
    
    The problem is not with the slides package, it's with the content I want to put on the slides:
    https://github.com/Dherse/codly/issues/35 and in particular, this comment from the author of codly where he sounds constrained by Typst: https://github.com/Dherse/codly/issues/35#issuecomment-24667...
    
    Aaron2222 4 days ago
    
    My impressions with Typst have been very positive, though I did notice the same issue with Codly. Typst is still quite a new tool, so my expectation would be that this kind of issue will be fixed at some point.
- __mharrison__ 4 days ago
  
  Did you file an issue? The devs are quite responsive and now is the time to act while it is making quick progress.
  - Aaron2222 3 days ago
    
    They did[0] (as per their earlier reply[1]), and now it's fixed.
    [0]: https://github.com/Dherse/codly/issues/35
    [1]: https://news.ycombinator.com/item?id=42103510
amichail 4 days ago

Have you tried TeXmacs or its fork Mogan?

phonon 4 days ago

This overlooks CSS Paged Media based options like paged.js, weasyprint, etc. (You can find the full list here..some open source, some commercial)[0]

[0]https://www.print-css.rocks/tools

xiaohanyu 4 days ago

Author here.
I mentioned https://polytype.dev/ in the end of the post, which has pages.js included.
Is not that hard to simulate pagination with JavaScript, the deal breaker for me is still line breaking and also mixed languages typesetting nuances.
- phonon 4 days ago
  
  That's pretty indirect....you might want to look more closely... https://drafts.csswg.org/css-page/ covers quite a lot...
  You also didn't mention the new (Chrome only) CSS text-wrap: pretty
  https://developer.chrome.com/blog/css-text-wrap-pretty
  https://docs.google.com/document/d/1jJFD8nAUuiUX6ArFZQqQo8yT...
  https://chromestatus.com/feature/5145771917180928

lejalv 4 days ago

GNU TeXmacs (https://youtu.be/H46ON2FB30U) is missing.

sundarurfriend 4 days ago

I firmly believe that just its name set TeXmacs back by a lot. It's a pretty great choice, was even more so a decade or two ago (when typst and the ilk weren't available). But every single time I've tried to introduce it to someone, it's always some variant of "Oh, I don't use Emacs, so a plugin for that isn't useful to me". It's a really unintuitive and unfortunate naming choice that I wish they'd at least changed at some point along the way.
- xiaohanyu 4 days ago
  
  Any idea about TeXmacs support for non-latin languages?
  - GiovanniP 3 days ago
    
    There is some support, but I do not know the details. You might try and ask in the TeXmacs forum, at http://forum.texmacs.cn/; a few of the developers read it and answer questions and they might have the information you are looking for.

favorited 2 days ago

Interesting that justified text was such a focus point. I've been told by accessibility consultants to avoid justification alignment in the products I work on because it can be more challenging for some folks to read (IIRC people with dyslexia in particular).

dpflug 4 days ago

Another: https://www.sisudoc.org/

fithisux 4 days ago

sile is missing

AlbertoGP 4 days ago

It is mentioned in passing though, with a link to its website:
> It is widely adopted by various typesetting engines like TeX, SILE[1] and Typst, etc.
> [1] https://sile-typesetter.org/
- fithisux 4 days ago
  
  I stay corrected then.