8    Internationalization

This chapter describes the internationalization features of Tru64 UNIX. The first section provides a brief internationalization overview (Section 8.1), after which the following topics are discussed:

8.1    Overview

The term "internationalization" is formally defined by The Open Group as a

"provision within a computer program of the capability of making itself adaptable to the requirements of different native languages, local customs, and coded character sets"

This essentially means that internationalized programs can run in any supported locale without having to be modified. A locale is a software environment that correctly handles the cultural conventions of a particular geographic area, such as China or France, and a language as it is used in that area. So by selecting a Chinese locale, for example, all commands, system messages, and keystrokes can be in Chinese characters and displayed in a way appropriate for Chinese.

Tru64 UNIX is an internationalized operating system that not only allows users to interact with existing applications in their native language, but also supports a full set of application interfaces, referred to as the Worldwide Portability Interfaces (WPI), to enable software developers to write internationalized applications. The original code for these interfaces came from the Open Software Foundation (OSF) and has been enhanced.

The internationalization support in the operating system conforms to The Open Group's CAE specifications for system interfaces and headers (XSH Issue 5), curses (XCURSES Issue 4.2), and commands and utilities (XCU Issue 5). These specifications align with current POSIX and ISO C standards. This conformance ensures that commands, utilities, and libraries have been internationalized, and their corresponding message catalogs have been included in the base system.

Tru64 UNIX conforms to the Chinese Character Input Standard, GB18030-2000, which went into effect on September 1, 2001.

In addition, the operating system supports the X Input Method (XIM) and X Output Method (XOM) to facilitate input of local language characters, text drawing, measurement, and interclient communication. These functions are implemented according to the X11R6.3 specification and include some problem corrections specified by X11R6.4.

Note that the operating system also supports a 32-bit wchar_t datatype which in turn enables support for a wide array of codesets, including the one defined by the ISO 10646 standard.

See the following information about internationalization on the Tru64 UNIX operating system:

8.2    Supported Languages

Most locales are included in Worldwide Language Support (WLS) subsets that are optionally installed. Some, as indicated in Table 8-1, are part of the mandatory base operating system.

Locales whose names end in .UTF-8 use file code and internal process code (wchar_t encoding) defined in the ISO 10646 and Unicode standards. Other, non-UTF-8 Unicode locales use traditional UNIX and proprietary codesets for the file code while using UTF-32 as the internal process code. A subset of these Unicode locales have a @ucs4 modifier; however, they are the same as the locales without the @ucs4 modifier.

The universal.UTF-8 locale is also available (for use by applications rather than end users). It supports the complete set of characters in the universal character set (UCS). See unicode(5)) for more information about encoding formats.

UTF-8 and Latin-9 (ISO 8859-15) locales support the euro currency symbol.

For the most up-to-date list of supported languages and locales, refer to the l10n_intro(5) reference page.

Table 8-1 lists the languages supported by the operating system and their corresponding locales.

Table 8-1:  Languages and Locales

Language Locale Name
Catalan

ca_ES.ISO8859-1
[Footnote 2]
 
ca_ES.ISO8859-15
ca_ES.UTF-8

Chinese, Simplified (PRC)

zh_CN.UTF-8
zh_CN.dechanzi
zh_CN.dechanzi@ucs4
zh_CN.dechanzi@pinyin
zh_CN.dechanzi@pinyin@ucs4
zh_CN.dechanzi@radical
zh_CN.dechanzi@radical@ucs4
zh_CN.dechanzi@stroke
zh_CN.dechanzi@stroke@ucs4
zh_CN.GBK
zh_CN.GB18030

Chinese, Traditional(Hong Kong)

zh_HK.big5
zh_HK.dechanyu
zh_HK.dechanyu@ucs4
zh_HK.dechanzi
zh_HK.dechanzi@ucs4
zh_HK.eucTW
zh_HK.eucTW@ucs4
zh_HK.UTF-8

Chinese, Traditional (Taiwan)

zh_TW.big5
zh_TW.big5@chuyin
zh_TW.big5@radical
zh_TW.big5@stroke
zh_TW.dechanyu
zh_TW.dechanyu@ucs4
zh_TW.dechanyu@chuyin
zh_TW.dechanyu@chuyin@ucs4
zh_TW.dechanyu@radical
zh_TW.dechanyu@radical@ucs4
zh_TW.dechanyu@stroke
zh_TW.dechanyu@stroke@ucs4
zh_TW.eucTW
zh_TW.eucTW@ucs4
zh_TW.eucTW@chuyin
zh_TW.eucTW@chuyin@ucs4
zh_TW.eucTW@radical
zh_TW.eucTW@radical@ucs4
zh_TW.eucTW@stroke
zh_TW.eucTW@stroke@ucs4
zh_TW.UTF-8

Czech

cs_CZ.ISO8859-2
cs_CZ.ISO8859-2@ucs4

Danish

da_DK.ISO8859-1
[Footnote 2]
da_DK.ISO8859-15
da_DK.UTF-8

Dutch

nl_NL.ISO8859-1
[Footnote 2]
nl_NL.ISO8859-15
nl_NL.UTF-8

Dutch, Belgian

nl_BE.ISO8859-1
[Footnote 2]
nl_BE.ISO8859-15
nl_BE.UTF-8

English, U.S.(ASCII) C (POSIX) [Footnote 2]
English, U.S.

en_US.ISO8859-1
[Footnote 2]
en_US.ISO8859-15
en_US.cp850.
en_US.UTF-8,
en_US.UTF-8@euro
[Footnote 3]
 

English, U.K.

en_GB.ISO8859-1
[Footnote 2]
en_GB.ISO8859-15
en_GB.UTF-8

European en_EU.UTF-8@euro [Footnote 4]
Finnish

fi_FI.ISO8859-1
[Footnote 2]
fi_FI.ISO8859-15
fi_FI.UTF-8

French

fr_FR.ISO8859-1
[Footnote 2]
fr_FR.ISO8859-15
fr_FR.UTF-8

French, Belgian

fr_BE.ISO8859-1
[Footnote 2]
fr_BE.ISO8859-15
fr_BE.UTF-8

French, Canadian

fr_CA.ISO8859-1
[Footnote 2]
fr_CA.ISO8859-15
fr_CA.UTF-8

French, Swiss

fr_CH.ISO8859-1
[Footnote 2]
fr_CH.ISO8859-15
fr_CH.UTF-8

German

de_DE.ISO8859-1
[Footnote 2]
de_DE.ISO8859-15
de_DE.UTF-8

German, Swiss

de_CH.ISO8859-1
[Footnote 2]
de_CH.ISO8859-15
de_CH.UTF-8

Greek

el_GR.ISO8859-7,
el_GR.ISO8859-7@ucs4
el_GR.UTF-8

Hebrew

he_IL.ISO8859-8
he_IL.ISO8859-8@ucs4
 

Hungarian

hu_HU.ISO8859-2
hu_HU.ISO8859-2@ucs4

Icelandic

is_IS.ISO8859-1
[Footnote 2]
is_IS.ISO8859-15

Italian

it_IT.ISO8859-1
[Footnote 2]
it_IT.ISO8859-15
it_IT.UTF-8

Japanese

ja_JP.eucJP
ja_JP.SJIS
ja_JP.SJIS@ucs4
ja_JP.deckanji
ja_JP.deckanji@ucs4
ja_JP.sdeckanji
ja_JP.UTF-8

Korean

ko_KR.deckorean
ko_KR.deckorean@ucs4
ko_KR.eucKR
ko_KR.KSC5601
ko_KR.UTF-8

Lithuanian

lt_LT.ISO8859-4
lt_LT.ISO8859-4@ucs4

Norwegian

no_NO.ISO8859-1
[Footnote 2]
no_NO.ISO8859-15
no_NO.UTF-8

Polish

pl_PL.ISO8859-2
pl_PL.ISO8859-2@ucs4

Portuguese

pt_PT.ISO8859-1
[Footnote 2]
pt_PT.ISO8859-15
pt_PT.UTF-8

Russian

ru_RU.ISO8859-5
ru_RU.ISO8859-5@ucs4

Slovak

sk_SK.ISO8859-2
sk_SK.ISO8859-2@ucs4

Slovene

sl_SI.ISO8859-2
sl_SI.ISO8859-2@ucs4

Spanish

es_ES.ISO8859-1
[Footnote 2]
es_ES.ISO8859-15
es_ES.UTF-8

Swedish

sv_SE.ISO8859-1
[Footnote 2]
sv_SE.ISO8859-15
sv_SE.UTF-8

Thai

th_TH.TACTIS

Turkish

tr_TR.ISO8859-9
tr_TR.ISO8859-9@ucs4

Note that you can switch languages or character sets as necessary and can even operate multiple processes in different languages or codesets in the same system at the same time.

For more information on a particular coded character set, such as ISO8859-9, see the reference page with the same name. For more information about UCS-4 and UTF-8 encoding, see Unicode(5). For more information about PC code pages, see code_page(5).

8.3    Locale Creation

The localedef utility allows programmers to create their own locales, compile their source code, and generate a unique name for their new locale.

For more information on creating locales, see Writing Software for the International Market.

8.4    Codeset Conversion

The operating system includes the iconv utility and the iconv_open(), iconv(), and iconv_close() functions, which convert text from one codeset to another, thereby assisting programmers in the writing of international applications. For use with these interfaces, the operating system includes a large set of codeset converters.

The en_US.UTF-8 X locale database file contains font definitions that include all the various fonts used with the operating system. Thus, applications running under the en_US.UTF-8 locale can display all the font characters installed with Worldwide Language Support (WLS). Applications running under the Asian locales display all of the WLS installed fonts, except for ISO8859-2, -4, -5, -7, -8, -9, and TACTIS.

In addition to conversion between different codesets for the same language, these converters support conversion between different Unicode formats, such as UCS-2, UCS-4, and UTF-8. There are also codeset converters that handle the most commonly used PC code-page formats.

Codeset conversion is also used by the printing subsystem and utilities, such as man, to allow processing of files in different languages and encoding formats. Additionally, codeset conversion is implemented in mail utilities for mail interchange with systems using different codesets and in the X Windows System Toolkit for text input, drawing, and interclient communication. For more information on codeset conversion, see the iconv_intro(5) reference page. See the Unicode(5) and code_page(5) reference pages for a discussion of converters for Unicode encoding formats and PC code-page formats, respectively.

8.5    Unicode and Dense Code Locales

When you install Worldwide Language Support, Tru64 UNIX provides localization support with two types of locales: Unicode locales and dense code locales.

Unicode locales conform to Unicode and ISO/IEC 10646 standards and use UTF-32 as the wide character encoding. Under UTF-32 wide character encoding, wchar_t values represent the same characters regardless of the locale and, because Unicode standards prevail, implementation is consistent across platforms.

Dense code locales use dense code for wide character encoding to minimize table size (that is, codepoints are assigned consecutively with no empty positions).

In addition to UTF-8 locales, which use ISO 10646 (Unicode) as both the internal and external representation of characters, the dense code and Unicode locales provide functionally equivalent versions of many locales.

The dense code locales are those with names that end in a code set other than UTF-8 (for example, ISO8859-1, eucJP, GB18030). The non-UTF-8 Unicode locales are those that include @ucs4 at the end of the locale name. A sample pair of dense and Unicode locales is pl_PL.ISO8859-2 and pl_PL.ISO8859-2@ucs4.

In general, the same charmaps and locale source can be used for dense code and Unicode locales. However, characters that are not defined in the LC_COLLATE section of the locale source may sort differently in the two types of locales.

For Latin-1 locales (ISO 8859-1), the dense code and Unicode locales are identical because Latin-1 characters are the same as the first 256 characters in Unicode. The operating system also supports three UCS transformation formats (UTFs), UTF-8, UTF-16, and UTF-32, all of which are defined in the Unicode standard. See Unicode(5) for a full description of Unicode, UCS-4, and the transformation formats.

To switch between Unicode and dense code locales, the system administrator, as root, uses i18nconfig to change the systemwide default or manually changes the symbolic link /usr/i18n/lib/nls/dloc from ./ucsloc to ./loc.

8.6    Unicode Support

Tru64 UNIX supports the Unicode Standard Version 3.1 and ISO 10646 standards through a set of UCS-4 and UTF-8 based locales. Codeset conversion capability among UCS-4 (UTF-32), UCS-2 (UTF-16), and UTF-8 formats is provided for all supported codesets. Conversion support between Unicode and a number of single-byte PC code pages and from those PC code pages to the ISO Latin codeset is provided. For more information on the Unicode locales, see Unicode(5).

8.7    Configure International Software Utility

The Configure International Software utility allows system administrators to manage country support subsets, Asian terminal drivers, installed font files, the local language settings and input method, user accounts, and the Japanese Input Method (Wnn). Configuration of these WLS options establishes an operating system environment for writing and using internationalized applications. These options also allow system administrators and users to display keyboard mappings.

The Configure International Software utility is a menu-oriented function available from the SysMan Menu under the Software option. You must be root, or have the appropriate system administrator privileges, to use the Configure International Software utility to do the following:

8.8    Support for the Euro Currency Symbol

Tru64 UNIX supports the euro currency symbol. Locales that use the UTF-8 or Latin-9 (ISO 8859-15) codesets support the euro characters, while locales with a @euro suffix define the local currency sign to be the euro character.

The locale en_EU.UTF-8@euro is an English locale providing support for the euro symbol, decimal as comma, and period as thousands separator. Printer support for the euro character is enabled by a generic PostScript print filter, (wwpsof).

Keyboard entry of the euro character is supported by key sequences defined in keymaps and through use of the Compose key. Also, codeset converters convert file data between the various encoding formats that support the euro character. See the euro(5) and wwpsof(8) reference pages for more information.

8.9    The dxim Input Server

The multilingual input server dxim gives you the means to use and manage input methods for Korean, as well as traditional and simplified Chinese.

The dxim input server menu is has two functional parts: Customizing Input Method Classes and Methods and Customizing Input Method Window.

The dxim input server can support multiple clients working under different locales. When a client application connects to dxim, the input server determines the client's locale and, if compatible, uses the default input method. If the client locale is not compatible with the default input method, dxim searches for an active input method that is compatible. The input server uses the first compatible input method it finds.

For additional information on the dxim input server, see dxim(1X) and the dxim online help.

8.10    Internationalized Curses Library

The operating system supplies an internationalized Curses library in conformance with X/Open Curses, Issue 4 Version 2. This library provides functions for processing characters that span one or multiple bytes. These characters may be in either wide-character (wchar_t) or complex-character (cchar_t) formats. The complex-character format provides for a single logical character made up of multiple wide characters. Some of the components of the complex character may be nonspacing characters.

For information on the syntax and effect of Curses interfaces, see curses(3). For a description of the enhancements provided by the internationalized Curses routines, and their relationship to previous Curses routines, see Writing Software for the International Market.

8.11    Additional Internationalization Features

Tru64 UNIX supports the following internationalization utilities and features: