What Unicode characters does pdfLaTeX support with a minimal preamble?

Question

Consider a LaTeX document with the minimal configuration necessary to support UTF-8 characters — as I understand it, such a document would look like this:
documentclass{article}
usepackage[T1]{fontenc}
usepackage[utf8]{inputenc}
begin{document}
Text goes here.
end{document}

In such a document, what Unicode characters does pdfLaTeX recognize & support?  Is it just the characters belonging to the T1 encoding (which apparently looks like this)?  Is it a function of the current font?  Is it a built-in ad hoc list that tends to expand with new LaTeX/pdfTeX/pdfLaTeX releases?
This answer tells me that the characters are defined in t1enc.dfu (and/or utf8enc.dfu?), but I'm interested in a more intensional, not extensional, answer.

Ulrike Fischer · Accepted Answer

A current LaTeX format will input omsenc.dfu, ot1enc.dfu, t1enc.dfu and ts1enc.dfu (this is new in a current latex compared to the answer you linked too).
You can find all four files in tex/latex/base and check which unicode input they support. As the names imply their support range is related to output encodings but can there is no strict 1-1 relationship. t1enc.def e.g. contains also DeclareUnicodeCharacter{00A0}{nobreakspace}.
It is not needed with a current latex to load inputenc. utf8 is the default anyway. So you get this support also with this document:
documentclass{article}
usepackage[T1]{fontenc}

begin{document}
Text goes here.
end{document}

David Carlisle · Answer

Your question is rather undefined as "minimal preamble" can be interpreted to mean "the minimal required to support the Unicode Characters needed" which is somewhat circular.
The example preamble posted produces the following if I add Cyrillic text
! Package inputenc Error: Unicode character П (U+041F)
(inputenc)                not set up for use with LaTeX.

As Cyrillic codepoints are not set up by default, but independent of the input encoding they would not typeset anyway as T1 font encoding is specified, which only covers Latin alphabet.
You do not need inputenc in current latex as UTF-8 is the default, and if you specify a font encoding such as X2 that includes Cyrillic, suitable Unicode mappings will be loaded
x2enc.dfu which is in the base latex distribution.
So this runs without error:
documentclass{article}
usepackage[T1,X2]{fontenc}
begin{document}
{fontencoding{T1}selectfont Text goes here}. Привет
end{document}

The file /usr/local/texlive/2020/texmf-dist/tex/latex/base/utf8enc.dfu  (use kpsewhich utf8enc.dfu to find the file on your local system) lists all the characters declared in encoding dfu files in the base distribution but contributed packages may add more.
grep '[.]dfu' `kpsewhich --all ls-R`

will list all the ones available, as well as the core latin Greek and Cyrillic encodings I see armglyphs.dfu pmboxdrawenc.dfu otf-hangul.dfufor example.
Basically the restriction is not on the interpretation of UTF-8. Pdflatex's inputenc code understands the full UTF-8 encoding and so you can specify any Unicode number. But a font in pdflatex can only have 256 characters so most Unicode characters can not be defined until you specify a font to cover the required character set.
If you have a font that covers a Unicode range; the matching inputenc mapping probably already exists (and will be input automatically for any font encoding declared in the preamble) or can easily be added.

What Unicode characters does pdfLaTeX support with a minimal preamble?

2 Answers

Add your own answers!

Ask a Question