.TH UTFPATGEN 1 "30 May 2026" "utfpatgen 1.0" .\"===================================================================== .if t .ds TX \fRT\\h'-0.10m'\\v'0.17v'E\\v'-0.17v'\\h'-0.06m'X\fP .if n .ds TX TeX .ie t .ds OX \fIT\h'-0.17m'\v'+0.21m'E\v'-0.21m'\h'-0.04m'X\fP .el .ds OX TeX .\"===================================================================== .SH NAME utfpatgen \- generate patterns for TeX hyphenation .SH SYNOPSIS .B utfpatgen .I dictionary_file pattern_file patout_file translate_file .\"===================================================================== .SH DESCRIPTION .I UTFpatgen is an extension to .BR patgen (1) for generating patterns from large input alphabets, with an extended hyphenation level range and native dynamic memory management. .PP The program reads a .I dictionary_file containing a list of hyphenated words and a .I pattern_file containing previously-generated patterns (if any) for a particular language (not a complete \*(TX source file; see below), and produces the .I patout_file with (previously- plus newly-generated) hyphenation patterns for that language. .PP The .I translate_file defines language specific values for the parameters .IR left_hyphen_min " and " right_hyphen_min used by \*(TX's hyphenation algorithm and the external representation of the lower and upper case version(s) of all `letters' of that language. .PP Further details of the pattern generation process such as hyphenation levels and pattern lengths are requested interactively from the user's terminal. Optionally, .I UTFpatgen creates a new dictionary file .BI pattmp. n showing the good and bad hyphens found by the generated patterns, where .I n is the highest hyphenation level. .PP All filenames must be complete; no adding of default extensions or path searching is done. .\"===================================================================== .SH INPUT FORMATS .TP \w'@@'u+2n .B Letters .I UTFpatgen is able to process any UTF-8 encoded character, or more generally, any encoding that is prefix-free (no letter is a prefix of another) and does not use the `0xFF' byte, which has a special meaning in .IR UTFpatgen ), described next: .TP \w'@@'u+2n .B Levels and weights Non-character parts of the text, such as hyphenation levels or weights, should be represented as a 2-byte sequence `0xFF '. If a file uses the .BR patgen (1) encoding (ASCII numerals), we recommend using .BR sed (1) for conversion. .TP \w'@@'u+2n .B File formats The formats and conventions required in the 4 input files ( .I dictionary_file, pattern_file, patout_file, translate_file ) are identical to those in .BR patgen (1) with the only exception of level and weight encoding described earlier. .\"===================================================================== .SH "SEE ALSO" Frank Liang, .IR "Word hy-phen-a-tion by com-puter" , STAN-CS-83-977, Stanford University Ph.D. thesis, 1983, http://tug.org/docs/liang. .PP Donald E. Knuth, .IR "The \*(OXbook" , Addison-Wesley, Appendix H. .TP https://ctan.org/pkg/patgen The original patgen program, by Frank Liang, with system updates by Peter Breitenlohner. .TP https://ctan.org/pkg/hyph-utf8 Collected hyphenation patterns for many languages in many formats. .TP https://ctan.org/tex-archive/language/ General CTAN directory for patterns and support for many other languages. .TP https://tug.org/TUGboat/Contents/listkeyword.html#CatTAGMultilingualDocumentProcessing \fITUGboat\fP articles on hyphenation and other aspects of language-specific document processing. .\"===================================================================== .SH AUTHORS Ondřej Metelka .br Released under the MIT license. .br https://ctan.org/pkg/utfpatgen