PostgreSQL: Documentation: 9.4: Character Set Support

PostgreSQL: Documentation: 9.4: Character Set Support

时间:2015-07-03 07:29来源:网络整理 作者:KKWL 点击:
22.3. Character Set Support The character set support in PostgreSQL allows you to store text in a variety of character sets (also called encodings), including single-byte character sets such as the ISO 8859 series and multiple-byte characte

22.3. Character Set Support

The character set support in PostgreSQL allows you to store text in a variety of character sets (also called encodings), including single-byte character sets such as the ISO 8859 series and multiple-byte character sets such as EUC (Extended Unix Code), UTF-8, and Mule internal code. All supported character sets can be used transparently by clients, but a few are not supported for use within the server (that is, as a server-side encoding). The default character set is selected while initializing your PostgreSQL database cluster using initdb. It can be overridden when you create a database, so you can have multiple databases each with a different character set.

An important restriction, however, is that each database's character set must be compatible with the database's LC_CTYPE (character classification) and LC_COLLATE (string sort order) locale settings. For C or POSIX locale, any character set is allowed, but for other locales there is only one character set that will work correctly. (On Windows, however, UTF-8 encoding can be used with any locale.)

22.3.1. Supported Character Sets

shows the character sets available for use in PostgreSQL.

Table 22-1. PostgreSQL Character Sets

Name Description Language Server? Bytes/Char Aliases

BIG5 Big Five Traditional Chinese No 1-2 WIN950, Windows950

EUC_CN Extended UNIX Code-CN Simplified Chinese Yes 1-3  

EUC_JP Extended UNIX Code-JP Japanese Yes 1-3  

EUC_JIS_2004 Extended UNIX Code-JP, JIS X 0213 Japanese Yes 1-3  

EUC_KR Extended UNIX Code-KR Korean Yes 1-3  

EUC_TW Extended UNIX Code-TW Traditional Chinese, Taiwanese Yes 1-3  

GB18030 National Standard Chinese No 1-4  

GBK Extended National Standard Simplified Chinese No 1-2 WIN936, Windows936

ISO_8859_5 ISO 8859-5, ECMA 113 Latin/Cyrillic Yes 1  

ISO_8859_6 ISO 8859-6, ECMA 114 Latin/Arabic Yes 1  

ISO_8859_7 ISO 8859-7, ECMA 118 Latin/Greek Yes 1  

ISO_8859_8 ISO 8859-8, ECMA 121 Latin/Hebrew Yes 1  

JOHAB JOHAB Korean (Hangul) No 1-3  

KOI8R KOI8-R Cyrillic (Russian) Yes 1 KOI8

KOI8U KOI8-U Cyrillic (Ukrainian) Yes 1  

LATIN1 ISO 8859-1, ECMA 94 Western European Yes 1 ISO88591

LATIN2 ISO 8859-2, ECMA 94 Central European Yes 1 ISO88592

LATIN3 ISO 8859-3, ECMA 94 South European Yes 1 ISO88593

LATIN4 ISO 8859-4, ECMA 94 North European Yes 1 ISO88594

LATIN5 ISO 8859-9, ECMA 128 Turkish Yes 1 ISO88599

LATIN6 ISO 8859-10, ECMA 144 Nordic Yes 1 ISO885910

LATIN7 ISO 8859-13 Baltic Yes 1 ISO885913

LATIN8 ISO 8859-14 Celtic Yes 1 ISO885914

LATIN9 ISO 8859-15 LATIN1 with Euro and accents Yes 1 ISO885915

LATIN10 ISO 8859-16, ASRO SR 14111 Romanian Yes 1 ISO885916

MULE_INTERNAL Mule internal code Multilingual Emacs Yes 1-4  

SJIS Shift JIS Japanese No 1-2 Mskanji, ShiftJIS, WIN932, Windows932

SHIFT_JIS_2004 Shift JIS, JIS X 0213 Japanese No 1-2  

SQL_ASCII unspecified (see text) any Yes 1  

UHC Unified Hangul Code Korean No 1-2 WIN949, Windows949

UTF8 Unicode, 8-bit all Yes 1-4 Unicode

WIN866 Windows CP866 Cyrillic Yes 1 ALT

WIN874 Windows CP874 Thai Yes 1  

WIN1250 Windows CP1250 Central European Yes 1  

WIN1251 Windows CP1251 Cyrillic Yes 1 WIN

WIN1252 Windows CP1252 Western European Yes 1  

WIN1253 Windows CP1253 Greek Yes 1  

WIN1254 Windows CP1254 Turkish Yes 1  

WIN1255 Windows CP1255 Hebrew Yes 1  

WIN1256 Windows CP1256 Arabic Yes 1  

WIN1257 Windows CP1257 Baltic Yes 1  

WIN1258 Windows CP1258 Vietnamese Yes 1 ABC, TCVN, TCVN5712, VSCII

Not all client APIs support all the listed character sets. For example, the PostgreSQL JDBC driver does not support MULE_INTERNAL, LATIN6, LATIN8, and LATIN10.

The SQL_ASCII setting behaves considerably differently from the other settings. When the server character set is SQL_ASCII, the server interprets byte values 0-127 according to the ASCII standard, while byte values 128-255 are taken as uninterpreted characters. No encoding conversion will be done when the setting is SQL_ASCII. Thus, this setting is not so much a declaration that a specific encoding is in use, as a declaration of ignorance about the encoding. In most cases, if you are working with any non-ASCII data, it is unwise to use the SQL_ASCII setting because PostgreSQL will be unable to help you by converting or validating non-ASCII characters.

22.3.2. Setting the Character Set

initdb defines the default character set (encoding) for a PostgreSQL cluster. For example,

initdb -E EUC_JP