8 Internationalization

This chapter describes the internationalization features of Tru64 UNIX. The first section provides a brief internationalization overview (Section 8.1), after which the following topics are discussed:

Supported languages (Section 8.2)

Using the localedef utility to create locales (Section 8.3)

Converting text from one codeset to another (Section 8.4)

Unicode locales and dense code locales for WLS localization (Section 8.5

Support for the Unicode Standard Version 3.1 and ISO 10646 standards (Section 8.6)

The Configure International Software options (Section 8.7)

Support for the euro currency symbol (Section 8.8)

The dxim input server (Section 8.9)

The internationalized Curses library (Section 8.10)

Additional internationalization utilities and features supported by the operating system (Section 8.11

8.1 Overview

The term "internationalization" is formally defined by The Open Group as a

"provision within a computer program of the capability of making itself adaptable to the requirements of different native languages, local customs, and coded character sets"

This essentially means that internationalized programs can run in any supported locale without having to be modified. A locale is a software environment that correctly handles the cultural conventions of a particular geographic area, such as China or France, and a language as it is used in that area. So by selecting a Chinese locale, for example, all commands, system messages, and keystrokes can be in Chinese characters and displayed in a way appropriate for Chinese.

Tru64 UNIX is an internationalized operating system that not only allows users to interact with existing applications in their native language, but also supports a full set of application interfaces, referred to as the Worldwide Portability Interfaces (WPI), to enable software developers to write internationalized applications. The original code for these interfaces came from the Open Software Foundation (OSF) and has been enhanced.

The internationalization support in the operating system conforms to The Open Group's CAE specifications for system interfaces and headers (XSH Issue 5), curses (XCURSES Issue 4.2), and commands and utilities (XCU Issue 5). These specifications align with current POSIX and ISO C standards. This conformance ensures that commands, utilities, and libraries have been internationalized, and their corresponding message catalogs have been included in the base system.

Tru64 UNIX conforms to the Chinese Character Input Standard, GB18030-2000, which went into effect on September 1, 2001.

In addition, the operating system supports the X Input Method (XIM) and X Output Method (XOM) to facilitate input of local language characters, text drawing, measurement, and interclient communication. These functions are implemented according to the X11R6.3 specification and include some problem corrections specified by X11R6.4.

Note that the operating system also supports a 32-bit wchar_t datatype which in turn enables support for a wide array of codesets, including the one defined by the ISO 10646 standard.

See the following information about internationalization on the Tru64 UNIX operating system:

Writing Software for the International Market (information for programmers)

Using International Software

The Tru64 UNIX worldwide language support page:
http://www.tru64unix.compaq.com/unix/i18n.htm

8.2 Supported Languages

Most locales are included in Worldwide Language Support (WLS) subsets that are optionally installed. Some, as indicated in Table 8-1, are part of the mandatory base operating system.

Locales whose names end in .UTF-8 use file code and internal process code (wchar_t encoding) defined in the ISO 10646 and Unicode standards. Other, non-UTF-8 Unicode locales use traditional UNIX and proprietary codesets for the file code while using UTF-32 as the internal process code. A subset of these Unicode locales have a @ucs4 modifier; however, they are the same as the locales without the @ucs4 modifier.

The universal.UTF-8 locale is also available (for use by applications rather than end users). It supports the complete set of characters in the universal character set (UCS). See unicode(5)) for more information about encoding formats.

UTF-8 and Latin-9 (ISO 8859-15) locales support the euro currency symbol.

For the most up-to-date list of supported languages and locales, refer to the l10n_intro(5) reference page.

Table 8-1 lists the languages supported by the operating system and their corresponding locales.

Table 8-1: Languages and Locales

Language	Locale Name
Catalan	`ca_ES.ISO8859-1` ^{[Footnote 2]} `ca_ES.ISO8859-15` `ca_ES.UTF-8`
Chinese, Simplified (PRC)	`zh_CN.UTF-8` `zh_CN.dechanzi` `zh_CN.dechanzi@ucs4` `zh_CN.dechanzi@pinyin` `zh_CN.dechanzi@pinyin@ucs4` `zh_CN.dechanzi@radical` `zh_CN.dechanzi@radical@ucs4` `zh_CN.dechanzi@stroke` `zh_CN.dechanzi@stroke@ucs4` `zh_CN.GBK` `zh_CN.GB18030`
Chinese, Traditional(Hong Kong)	`zh_HK.big5` `zh_HK.dechanyu` `zh_HK.dechanyu@ucs4` `zh_HK.dechanzi` `zh_HK.dechanzi@ucs4` `zh_HK.eucTW` `zh_HK.eucTW@ucs4` `zh_HK.UTF-8`
Chinese, Traditional (Taiwan)	`zh_TW.big5` `zh_TW.big5@chuyin` `zh_TW.big5@radical` `zh_TW.big5@stroke` `zh_TW.dechanyu` `zh_TW.dechanyu@ucs4` `zh_TW.dechanyu@chuyin` `zh_TW.dechanyu@chuyin@ucs4` `zh_TW.dechanyu@radical` `zh_TW.dechanyu@radical@ucs4` `zh_TW.dechanyu@stroke` `zh_TW.dechanyu@stroke@ucs4` `zh_TW.eucTW` `zh_TW.eucTW@ucs4` `zh_TW.eucTW@chuyin` `zh_TW.eucTW@chuyin@ucs4` `zh_TW.eucTW@radical` `zh_TW.eucTW@radical@ucs4` `zh_TW.eucTW@stroke` `zh_TW.eucTW@stroke@ucs4` `zh_TW.UTF-8`
Czech	`cs_CZ.ISO8859-2` `cs_CZ.ISO8859-2@ucs4`
Danish	`da_DK.ISO8859-1` ^{[Footnote 2]} `da_DK.ISO8859-15` `da_DK.UTF-8`
Dutch	`nl_NL.ISO8859-1` ^{[Footnote 2]} `nl_NL.ISO8859-15` `nl_NL.UTF-8`
Dutch, Belgian	`nl_BE.ISO8859-1` ^{[Footnote 2]} `nl_BE.ISO8859-15` `nl_BE.UTF-8`
English, U.S.(ASCII)	`C` (POSIX) ^{[Footnote 2]}
English, U.S.	`en_US.ISO8859-1` ^{[Footnote 2]} `en_US.ISO8859-15` `en_US.cp850`. `en_US.UTF-8`, `en_US.UTF-8@euro` ^{[Footnote 3]}
English, U.K.	`en_GB.ISO8859-1` ^{[Footnote 2]} `en_GB.ISO8859-15` `en_GB.UTF-8`
European	`en_EU.UTF-8@euro` ^{[Footnote 4]}
Finnish	`fi_FI.ISO8859-1` ^{[Footnote 2]} `fi_FI.ISO8859-15` `fi_FI.UTF-8`
French	`fr_FR.ISO8859-1` ^{[Footnote 2]} `fr_FR.ISO8859-15` `fr_FR.UTF-8`
French, Belgian	`fr_BE.ISO8859-1` ^{[Footnote 2]} `fr_BE.ISO8859-15` `fr_BE.UTF-8`
French, Canadian	`fr_CA.ISO8859-1` ^{[Footnote 2]} `fr_CA.ISO8859-15` `fr_CA.UTF-8`
French, Swiss	`fr_CH.ISO8859-1` ^{[Footnote 2]} `fr_CH.ISO8859-15` `fr_CH.UTF-8`
German	`de_DE.ISO8859-1` ^{[Footnote 2]} `de_DE.ISO8859-15` `de_DE.UTF-8`
German, Swiss	`de_CH.ISO8859-1` ^{[Footnote 2]} `de_CH.ISO8859-15` `de_CH.UTF-8`
Greek	`el_GR.ISO8859-7`, `el_GR.ISO8859-7@ucs4` `el_GR.UTF-8`
Hebrew	`he_IL.ISO8859-8` `he_IL.ISO8859-8@ucs4`
Hungarian	`hu_HU.ISO8859-2` `hu_HU.ISO8859-2@ucs4`
Icelandic	`is_IS.ISO8859-1` ^{[Footnote 2]} `is_IS.ISO8859-15`
Italian	`it_IT.ISO8859-1` ^{[Footnote 2]} `it_IT.ISO8859-15` `it_IT.UTF-8`
Japanese	`ja_JP.eucJP` `ja_JP.SJIS` `ja_JP.SJIS@ucs4` `ja_JP.deckanji` `ja_JP.deckanji@ucs4` `ja_JP.sdeckanji` `ja_JP.UTF-8`
Korean	`ko_KR.deckorean` `ko_KR.deckorean@ucs4` `ko_KR.eucKR` `ko_KR.KSC5601` `ko_KR.UTF-8`
Lithuanian	`lt_LT.ISO8859-4` `lt_LT.ISO8859-4@ucs4`
Norwegian	`no_NO.ISO8859-1` ^{[Footnote 2]} `no_NO.ISO8859-15` `no_NO.UTF-8`
Polish	`pl_PL.ISO8859-2` `pl_PL.ISO8859-2@ucs4`
Portuguese	`pt_PT.ISO8859-1` ^{[Footnote 2]} `pt_PT.ISO8859-15` `pt_PT.UTF-8`
Russian	`ru_RU.ISO8859-5` `ru_RU.ISO8859-5@ucs4`
Slovak	`sk_SK.ISO8859-2` `sk_SK.ISO8859-2@ucs4`
Slovene	`sl_SI.ISO8859-2` `sl_SI.ISO8859-2@ucs4`
Spanish	`es_ES.ISO8859-1` ^{[Footnote 2]} `es_ES.ISO8859-15` `es_ES.UTF-8`
Swedish	`sv_SE.ISO8859-1` ^{[Footnote 2]} `sv_SE.ISO8859-15` `sv_SE.UTF-8`
Thai	`th_TH.TACTIS`
Turkish	`tr_TR.ISO8859-9` `tr_TR.ISO8859-9@ucs4`

Note that you can switch languages or character sets as necessary and can even operate multiple processes in different languages or codesets in the same system at the same time.

For more information on a particular coded character set, such as ISO8859-9, see the reference page with the same name. For more information about UCS-4 and UTF-8 encoding, see Unicode(5). For more information about PC code pages, see code_page(5).

8.3 Locale Creation

The localedef utility allows programmers to create their own locales, compile their source code, and generate a unique name for their new locale.

For more information on creating locales, see Writing Software for the International Market.

8.4 Codeset Conversion

The operating system includes the iconv utility and the iconv_open(), iconv(), and iconv_close() functions, which convert text from one codeset to another, thereby assisting programmers in the writing of international applications. For use with these interfaces, the operating system includes a large set of codeset converters.

The en_US.UTF-8 X locale database file contains font definitions that include all the various fonts used with the operating system. Thus, applications running under the en_US.UTF-8 locale can display all the font characters installed with Worldwide Language Support (WLS). Applications running under the Asian locales display all of the WLS installed fonts, except for ISO8859-2, -4, -5, -7, -8, -9, and TACTIS.

In addition to conversion between different codesets for the same language, these converters support conversion between different Unicode formats, such as UCS-2, UCS-4, and UTF-8. There are also codeset converters that handle the most commonly used PC code-page formats.

Codeset conversion is also used by the printing subsystem and utilities, such as man, to allow processing of files in different languages and encoding formats. Additionally, codeset conversion is implemented in mail utilities for mail interchange with systems using different codesets and in the X Windows System Toolkit for text input, drawing, and interclient communication. For more information on codeset conversion, see the iconv_intro(5) reference page. See the Unicode(5) and code_page(5) reference pages for a discussion of converters for Unicode encoding formats and PC code-page formats, respectively.

8.5 Unicode and Dense Code Locales

When you install Worldwide Language Support, Tru64 UNIX provides localization support with two types of locales: Unicode locales and dense code locales.

Unicode locales conform to Unicode and ISO/IEC 10646 standards and use UTF-32 as the wide character encoding. Under UTF-32 wide character encoding, wchar_t values represent the same characters regardless of the locale and, because Unicode standards prevail, implementation is consistent across platforms.

Dense code locales use dense code for wide character encoding to minimize table size (that is, codepoints are assigned consecutively with no empty positions).

In addition to UTF-8 locales, which use ISO 10646 (Unicode) as both the internal and external representation of characters, the dense code and Unicode locales provide functionally equivalent versions of many locales.

The dense code locales are those with names that end in a code set other than UTF-8 (for example, ISO8859-1, eucJP, GB18030). The non-UTF-8 Unicode locales are those that include @ucs4 at the end of the locale name. A sample pair of dense and Unicode locales is pl_PL.ISO8859-2 and pl_PL.ISO8859-2@ucs4.

In general, the same charmaps and locale source can be used for dense code and Unicode locales. However, characters that are not defined in the LC_COLLATE section of the locale source may sort differently in the two types of locales.

For Latin-1 locales (ISO 8859-1), the dense code and Unicode locales are identical because Latin-1 characters are the same as the first 256 characters in Unicode. The operating system also supports three UCS transformation formats (UTFs), UTF-8, UTF-16, and UTF-32, all of which are defined in the Unicode standard. See Unicode(5) for a full description of Unicode, UCS-4, and the transformation formats.

To switch between Unicode and dense code locales, the system administrator, as root, uses i18nconfig to change the systemwide default or manually changes the symbolic link /usr/i18n/lib/nls/dloc from ./ucsloc to ./loc.

8.6 Unicode Support

Tru64 UNIX supports the Unicode Standard Version 3.1 and ISO 10646 standards through a set of UCS-4 and UTF-8 based locales. Codeset conversion capability among UCS-4 (UTF-32), UCS-2 (UTF-16), and UTF-8 formats is provided for all supported codesets. Conversion support between Unicode and a number of single-byte PC code pages and from those PC code pages to the ISO Latin codeset is provided. For more information on the Unicode locales, see Unicode(5).

8.7 Configure International Software Utility

The Configure International Software utility allows system administrators to manage country support subsets, Asian terminal drivers, installed font files, the local language settings and input method, user accounts, and the Japanese Input Method (Wnn). Configuration of these WLS options establishes an operating system environment for writing and using internationalized applications. These options also allow system administrators and users to display keyboard mappings.

The Configure International Software utility is a menu-oriented function available from the SysMan Menu under the Software option. You must be root, or have the appropriate system administrator privileges, to use the Configure International Software utility to do the following:

View and delete installed support for selected countries. Non-root users can only view current country support.

Configure support options, including Asian terminal driver support, Thai language support, pseudo terminal drivers with static or dynamic linking, the number of UNIX Terminal Extension (UTX) devices, and the rebuilding of the kernel after changes. Nonroot users cannot perform this task.

View and delete installed fonts. Nonroot users can only view installed fonts.

View installed keyboard map files and sort the display. Nonroot users can also view and sort installed keyboard map files.

View installed locales (consisting of installed languages, country support, and codesets), sort the display, change the system default locale, switch between dense code and Unicode locales, and select a locale input method. Nonroot users can only view and sort the display of installed locales.

Configure user, root, and system accounts for WLS support. Nonroot users can configure only the account from which they started the Configure International Software utility.

Configure Wnn, a character-cell input method for Japanese. Nonroot users can only view current Wnn settings.

8.8 Support for the Euro Currency Symbol

Tru64 UNIX supports the euro currency symbol. Locales that use the UTF-8 or Latin-9 (ISO 8859-15) codesets support the euro characters, while locales with a @euro suffix define the local currency sign to be the euro character.

The locale en_EU.UTF-8@euro is an English locale providing support for the euro symbol, decimal as comma, and period as thousands separator. Printer support for the euro character is enabled by a generic PostScript print filter, (wwpsof).

Keyboard entry of the euro character is supported by key sequences defined in keymaps and through use of the Compose key. Also, codeset converters convert file data between the various encoding formats that support the euro character. See the euro(5) and wwpsof(8) reference pages for more information.

8.9 The dxim Input Server

The multilingual input server dxim gives you the means to use and manage input methods for Korean, as well as traditional and simplified Chinese.

The dxim input server menu is has two functional parts: Customizing Input Method Classes and Methods and Customizing Input Method Window.

Customizing Input Method Classes and Methods allows you to do the following:
- Select a class of input methods that is appropriate to the locale of the client application. For an application internationalized for the Chinese language, you select and activate one or more of the following classes: traditional Chinese, simplified Chinese, or Phrase.
- Select and activate one or more input methods within a class. With the exception of the Phrase input method, the traditional and simplified Chinese classes under dxim support the same set of input methods as dxhanziim and dxhanyuim. The Phrase input method is a separate class under dxim and uses a different database than that used by the operating system Phrase Utility.
- Establish an input method class as the default.
- Establish an input method as the default for its class.
- Customize the simplified Chinese 5-Shape and Intelligent ABC input method classes.
- Customize error bell volume and set the input method invocation key.

The Customizing Input Method Window allows you to do the following:
- Increase or decrease the root input window font size.
- Set the root input window foreground and background color.
- Set the root input window line spacing.

The dxim input server can support multiple clients working under different locales. When a client application connects to dxim, the input server determines the client's locale and, if compatible, uses the default input method. If the client locale is not compatible with the default input method, dxim searches for an active input method that is compatible. The input server uses the first compatible input method it finds.

For additional information on the dxim input server, see dxim(1X) and the dxim online help.

8.10 Internationalized Curses Library

The operating system supplies an internationalized Curses library in conformance with X/Open Curses, Issue 4 Version 2. This library provides functions for processing characters that span one or multiple bytes. These characters may be in either wide-character (wchar_t) or complex-character (cchar_t) formats. The complex-character format provides for a single logical character made up of multiple wide characters. Some of the components of the complex character may be nonspacing characters.

For information on the syntax and effect of Curses interfaces, see curses(3). For a description of the enhancements provided by the internationalized Curses routines, and their relationship to previous Curses routines, see Writing Software for the International Market.

8.11 Additional Internationalization Features

Tru64 UNIX supports the following internationalization utilities and features:

Base tty terminal driver subsystem
This subsystem includes additional BSD line disciplines and STREAMS terminal driver modules for processing data in Chinese, Japanese, Korean, and Thai. For example, the enhanced terminal subsystem supports the following capabilities for these languages:
- Japanese Kana-Kanji conversion input method
- Character-based line processing in cooked mode
- Input line history and editing (BSD line discipline only)
- Software on-demand-loading for user-defined characters
- Conversion between terminal code and application code

The asort utility
This utility, an extension of the sort command, allows characters of ideogrammatic languages, like Chinese and Japanese, to be sorted according to multiple collation sequences. For more information on the asort utility, see asort(1).

Multilingual Emacs editor (MULE) for Asian languages
Mule is a multilingual enhancement to GNU Emacs. It provides a facility to display, input, and edit multilingual characters in addition to all GNU Emacs facilities. See mule(1) for more information.

User-defined characters in Chinese, Japanese, and Korean
Users can create and define character fonts and their attributes, including bitmap fonts, with the cedit and cgen utilities. Font-rendering facilities are available so that X clients can use UDC databases through the X server or font server to obtain bitmap fonts for user-defined characters.
For more information on user-defined characters, see Writing Software for the International Market, cedit(1), and cgen(1).

Printing plain text and PostScript files for various languages
Tru64 UNIX provides outline fonts for high quality printing on PostScript printers. In addition to print filters for a variety of local-language printers, generic internationalized print filters are available for use with a variety of printers. One of these filters, wwpsof, supports printing of local-language files on PostScript printers that do not include the required fonts.
For more information on internationalized printing features, see the i18n_printing(5), pcfof(8), and wwpsof(8) reference pages.

Mail and 8-Bit Character Support
By default, the operating system provides support for 8-bit character encoding in mailx, dtmail, MH, and comsat. See mailx(1), dtmail(1), mh(1), and comsat(8) for more information on these mail utilities.

The file command
This command recognizes UCS-2 and UCS-4 encoding in any locale setting. For other encoding formats, the command recognizes file data encoding if it is valid for the current locale setting. This command also has a jfile alias that, in any locale, can recognize DEC Kanji, Japanese EUC, Shift JIS, and 7-bit JIS encoding.

Internationalization for graphical applications
Motif Version 1.2.3 takes advantage of many of the internationalization features of X11R6 and the C library to support locales. Motif Version 1.2.3 also supports the use of alternate input methods, which allows input of non-ISO Latin-1 keystrokes, and delivers an extensively rewritten XmText widget, which supports multibyte and wide-character format and on-the-spot input style.
Motif supports multibyte and wide-character encoding through the use of the internationalized X Library functions and C Library functions. In addition, the compound string routines include the X11R6 XFontSet component to allow for the creation of localized strings.
The User Interface Language (UIL) supports the creation of localized UID files through the UIL compiler's -s compile-time option, which causes the compiler to construct localized strings.
Alternate input methods can be specified by a resource on the VendorShell widget. Widgets that are parented by a Shell class widget can take advantage of this resource and register themselves to a specific method for input.