The Compaq Tru64 UNIX operating system fully supports the following Korean codesets by including locales and codeset conversion support:
It also provides codeset conversion support for the following codesets:
The ASCII, KSC5636-1993 (KS Roman), and KSC5601-1992 character sets (excluding the additional Hangul characters defined an Annex 3 of the standard) are combined to form the DEC Korean codeset, which is denoted as deckorean.
DEC Korean uses a two-byte data representation for symbols and ideographic characters defined in KSC5601-1992. To differentiate KSC5601-1992 characters from ASCII, the most significant bit (MSB) of both bytes of KSC5601 characters is always set on.
The first byte of a two-byte code determines its row number, while the second determines its column number. The following formula illustrates the code of a two-byte KSC5601 character in relation to its row and column numbers:
1st byte = A0 + row number
2nd byte = A0 + column number
For example, if a character is at the first column of the 36th row, its encoded value is calculated as follows:
1st byte = A0 (hex) + 36 = C4 (hex)
2nd byte = A0 (hex) + 01 = A1 (hex)
In this case, the character code is C4A1.
Figure 2-2 illustrates the division of a two-byte code space and the position of KSC5601-1992 characters.
Extended UNIX Code (EUC) is an encoding methodology that allows concurrent use of up to four code sets in a data stream. Korean EUC uses that method to combine ASCII and KSC5601. Korean EUC is currently identical to DEC Korean, and is denoted as eucKR.
Microsoft has developed Unified Hangul Code (UHC) also known as "Extended Wansung" for its Windows 95 operating system. It is an optional character set of Win95K. Microsoft calls this Code Page 949.
UHC provides full compatibility with KSC5601-1992 EUC encoding, but adds additional encoding ranges to hold additional precombined Hangul characters (more precisely, the 8,822 that are needed to fully support the Johab character set). The following table provides the encoding ranges for UHC encoding:
Two-Byte Standard Characters |
Encoding Ranges |
---|---|
First byte range |
0x81-0xFE |
Second byte ranges |
0x41-0x5A, 0x61-0x7A |
One-Byte Characters |
Encoding Range |
---|---|
ASCII |
0x21-0x7E |
Note that the encoding ranges 0xA1A1 through 0xFEFE are identical in terms of character-to-code allocation with KSC5601-1992 in EUC Encoding.
The ISO-2022-KR codeset consists of the following character sets:
It is assumed that the starting code of the text is ASCII. ASCII and Korean characters are distinguished by use of the shift function. For example, the code SO indicates that the upcoming bytes are Korean characters as defined in KSC5601. To return to ASCII the SI code is used.
Therefore, the escape sequence, shift function and character set used in a text are as follows:
Control Sequence |
Character Set |
---|---|
SO |
KSC5601-1992 |
SI |
ASCII |
ESC $ ) C |
Appears once in the beginning of a line before any appearance of SO characters |
Currently, the ISO-2022-KR codeset can be used in codeset conversion.
UCS is a standard character encoding for the universal character set specified in the Unicode and ISO/IEC 10646 standards. UCS has two forms; UCS-2 (16-bit, or 2 octet units) and UCS-4 (32-bit, or 4 octet units). Unicode uses the UCS-2 form, which is commonly used on perconal computers. ISO/IEC allows either UCS-2 or UCS-4 encoding. UCS-4 encoding is in use on systems that can support the larger data unit size.
The current version of the Compaq Tru64 UNIX operating system supports both UCS-2 and UCS-4 encoding. UCS-4 is available in some Korean locales, and can be used in codeset conversion. For information about codeset conversion, see Section 2.7. For information about locales, see Chapter 3, Locales.
Unicode and ISO/IEC 10646 standards define transformation formats for the universal character set. For the most part, the following UCS transformation formats (UTFs) exist to transform UCS values into sequences of bytes to be handled by various byte-oriented protocols:
The current version of the Compaq Tru64 UNIX operating system supports UTF-8 and UTF-16 but not UTF-7. UTF-8 can be used in codeset conversion and in the UTF-8 locales. For information about codeset conversion, see Section 2.7. For information about locale variants, see Chapter 3, Locales.
The iconv utility provided by Compaq Tru64 UNIX converts the encoding of characters in one codeset to another and writes the results to standard output. Korean codeset converters provided are shown in Table 2-1.
DEC Korean |
Korean EUC |
ISO-2022-KR |
KSC5601/cp949 |
UCS-2/UTF-16 |
UCS-4 |
UTF-8 |
|
---|---|---|---|---|---|---|---|
DEC Korean |
- |
Y |
N |
Y |
Y |
Y |
Y |
Korean EUC |
Y |
- |
Y |
N |
N |
N |
N |
ISO-2022-KR |
N |
Y |
- |
Y |
N |
N |
N |
KSC5601/cp949 |
Y |
N |
Y |
- |
Y |
Y |
Y |
UCS-2/UTF-16 |
Y |
N |
N |
Y |
- |
Y |
Y |
UCS-4 |
Y |
N |
N |
Y |
Y |
- |
Y |
UTF-8 |
Y |
N |
N |
Y |
Y |
Y |
- |
For example, you can enter the following command to convert a DEC Korean file to a Korean UTF-8 file:
% iconv -f deckorean -t UTF-8 <file>
Table 2-2 shows the codesets and the strings you use as parameters to the iconv utility.
Codeset |
Parameter String |
---|---|
DEC Korean |
deckorean |
Korean EUC |
eucKR |
ISO-2022-KR |
ISO-2022-KR, iso-2022-kr |
Unified Hangul |
KSC5601,cp949 |
Universal Codeset |
UCS-2, UCS-4 |
Universal Transfer Format |
UTF-8 |
The Compaq Tru64 UNIX operating system provides a mechanism by which you configure your system to run applications with peripherals, such as terminals and printers, supporting different codesets. You can specify the codesets for the applications, terminals, and printers independently as shown in Table 2-3.
Application Code |
Terminal Code |
Printer Code |
---|---|---|
DEC Korean |
DEC Korean |
DEC Korean |
Korean EUC |
Korean EUC |
Korean EUC |
For details about setting up terminal code and printer code, see Writing Software for the International Market.