Presentation given at the TUG 2019 ... - latex-project.org · Presentation given at the TUG 2019...

5
Presentation given at the TUG 2019 Conference, Palo Alto file version: August 25, 2019 0:02 1 Taming UTF-8 in pdfT E X Frank Mittelbach Abstract To understand the concepts in pdfL A T E X for processing UTF-8 encoded files it is helpful to first take a look at the models used by the T E X engine and earlier attempts made by L A T E X on top of T E X. The talk provides a short historical review of that area and explains how it is possible in a T E X system that only understands 8-bit input to nevertheless interpret and process UTF-8 files successfully; what the obstacles are and how they can be overcome and what restrictions will remain if one doesn’t switch to a Unicode-aware engine such as LuaT E X or X E T E X. It will finish with an overview about the improvements with respect to UTF-8 handling that will be activated in L A T E X within 2019 and explains how they can already be tested today. The slides have been retrospectively constructed from the mindmap used during the presentation. Frank Mittelbach Mainz, Germany https://www.latex-project.org Taming UTF-8 in pdfTeX Frank Mittelbach TUG 2019, Palo Alto A short history lesson LaTeX2e solution A short history lession part 2 (UTF-8) Restrictions New - upcoming Slide #1 Taming UTF-8 in pdfT E X

Transcript of Presentation given at the TUG 2019 ... - latex-project.org · Presentation given at the TUG 2019...

Page 1: Presentation given at the TUG 2019 ... - latex-project.org · Presentation given at the TUG 2019 Conference, Palo Alto le version: August 25, 2019 0:02 1 Taming UTF-8 in pdfTEX Frank

Presentation given at the TUG 2019 Conference, Palo Alto file version: August 25, 2019 0:02 1

Taming UTF-8 in pdfTEX

Frank Mittelbach

Abstract

To understand the concepts in pdfLATEX for processing UTF-8 encoded files it is helpful to first take a lookat the models used by the TEX engine and earlier attempts made by LATEX on top of TEX. The talk providesa short historical review of that area and explains

• how it is possible in a TEX system that only understands 8-bit input to nevertheless interpret andprocess UTF-8 files successfully;

• what the obstacles are and how they can be overcome and

• what restrictions will remain if one doesn’t switch to a Unicode-aware engine such as LuaTEX or X ETEX.

It will finish with an overview about the improvements with respect to UTF-8 handling that will be activatedin LATEX within 2019 and explains how they can already be tested today.

The slides have been retrospectively constructed from the mindmap used during the presentation.

� Frank MittelbachMainz, Germanyhttps://www.latex-project.org

Taming UTF-8 in pdfTeX

Frank MittelbachTUG 2019, Palo Alto

A short history lesson

LaTeX2e solution

A short history lession part 2 (UTF-8)

Restrictions

New - upcoming

Slide #1

Taming UTF-8 in pdfTEX

Page 2: Presentation given at the TUG 2019 ... - latex-project.org · Presentation given at the TUG 2019 Conference, Palo Alto le version: August 25, 2019 0:02 1 Taming UTF-8 in pdfTEX Frank

2 file version: August 25, 2019 0:02 Presentation given at the TUG 2019 Conference, Palo Alto

A short history lesson

TeX79

7bit

Input „key code“ = Font slot

TeX82

8bitInput „key code“ = Font slot

8 bit code pages differ country by country

Font slots 129-255 not really used

TeX 3\language

Cork font encoding!

Problems 1995

Slide #2

Problems 1995

The German word „Größe“

A „pound“ symbol in italics

Slide #3

Frank Mittelbach

Page 3: Presentation given at the TUG 2019 ... - latex-project.org · Presentation given at the TUG 2019 Conference, Palo Alto le version: August 25, 2019 0:02 1 Taming UTF-8 in pdfTEX Frank

Presentation given at the TUG 2019 Conference, Palo Alto file version: August 25, 2019 0:02 3

maps to translated by

LaTeX2e solution

inputenc package LICR (LaTeX Internal Character Representation) fontenc package

LICR surives the roundtrip through .aux .toc, etc., since only ASCII chars are used and commands are protected from expansion

no dependencies on code pages

no dependencies on font slots

Font encodings are (in theory) well-defined

Fonts with the same encoding map exactly the same characters

or notno tofu!

Slide #4

A short history lession part 2 (UTF-8)

Encoding

1 byte = ascii 0xxxxxxx

2 bytes 110xxxxx 10xxxxxx

3 bytes 1110xxxx 10xxxxxx 10xxxxxx

...

Approach (in pdfTeX)

Features

Slide #5

Taming UTF-8 in pdfTEX

Page 4: Presentation given at the TUG 2019 ... - latex-project.org · Presentation given at the TUG 2019 Conference, Palo Alto le version: August 25, 2019 0:02 1 Taming UTF-8 in pdfTEX Frank

4 file version: August 25, 2019 0:02 Presentation given at the TUG 2019 Conference, Palo Alto

translated by

maps to

Approach (in pdfTeX)

Program reads only bytes Start byte is made „active“

This reads all necessary further bytes

Determines the Unicode slot

LICR (LaTeX Internal Character Representation) fontenc

we know that alreadytranslated by

maps to

Features

...

UTF8 characters are only supported if the glyph exists in the loaded fonts

But those again without tofu!

Slide #6

Restrictions

Each UTF-8 document needs \usepackage[utf8]{inputenc}

Multi-Byte UTF-8 can’t be used as part of command names

no \Straße

and also no \label{Überblick}

Multi-Byte UTF-8 has problems in \typeout etc

In file names only restricted usage possible --- if at all

Problems with \input, \include or graphic files

Example:

Slide #7

Frank Mittelbach

Page 5: Presentation given at the TUG 2019 ... - latex-project.org · Presentation given at the TUG 2019 Conference, Palo Alto le version: August 25, 2019 0:02 1 Taming UTF-8 in pdfTEX Frank

Presentation given at the TUG 2019 Conference, Palo Alto file version: August 25, 2019 0:02 5

Example:

Input\input{fürchterlich}\input{Straße}

OT1

! LaTeX Error: File `f\unhbox \voidb@x \bgroup \let \unhbox \voidb@x \setbox @tempboxa \hbox {u\global \mathchardef \accent@spacefactor\spacefactor }\accent 127 u\egroup \spacefactor \accent@spacefactor rchterlich.tex' not found.

! LaTeX Error: File `Stra\OT1\ss e.tex' not found.

T1

! LaTeX Error: File `fürchterlich.tex' not found.(only an error because the file didn’t exist)

! LaTeX Error: File `Stra\T1\ss e.tex' not found.

Slide #8

New - upcoming

April 2018

UTF-8 is the new default input encoding

Each UTF-8 document needs \usepackage[utf8]{inputenc}

Combining chars, e.g., „a“ + „U+0301“ = á are not supported

This is a pdfTeX restriction which can’t be realistically overcome

Fall 2019

Multi-Byte UTF-8 works now in ...

\label, \cite, \typeout, etc

in file names

character not in loaded fonts are allowed too!

but still not in command names

This is a pdfTeX restriction which can’t be realistically overcome

Space in file names are allowed without quoting the name

Available now for testing

LaTeX-dev format

Slide #9

Taming UTF-8 in pdfTEX