digoal
2016-12-05
PostgreSQL , conversion , pg_conversion , 拼音 , 编码转换 , convert , convert_to , convert_from
国内的应用,在文本排序上基本都是按照拼音来进行排序的。
在不同的字符集中,汉字的编码可能不一样,比如UTF8和GBK,其中GBK是按拼音的顺序进行编码的,而UTF8则不是。
所以如果你的数据库使用了UTF8编码,对中文字段进行排序时,可能得到的并不是按拼音排序的结果。
在PostgreSQL中,中文按拼音排序的编码包括GB18030, EUC_CN, GBK, BIG5, EUC_TW 等。
为了得到拼音排序,可以使用编码转换后的值来排序,索引也可以使用编码转换的表达式索引。
PostgreSQL支持的编码如下
https://www.postgresql.org/docs/9.6/static/multibyte.html
PostgreSQL Character Sets
Name| Description| Language| Server?| Bytes/Char| Aliases
---|---|---|---|---|---
BIG5| Big Five| Traditional Chinese| No| 1-2| WIN950, Windows950
EUC_CN| Extended UNIX Code-CN| Simplified Chinese| Yes| 1-3| -
EUC_JP| Extended UNIX Code-JP| Japanese| Yes| 1-3| -
EUC_JIS_2004| Extended UNIX Code-JP, JIS X 0213| Japanese| Yes| 1-3| -
EUC_KR| Extended UNIX Code-KR| Korean| Yes| 1-3| -
EUC_TW| Extended UNIX Code-TW| Traditional Chinese, Taiwanese| Yes| 1-3| -
GB18030| National Standard| Chinese| No| 1-4| -
GBK| Extended National Standard| Simplified Chinese| No| 1-2| WIN936, Windows936
ISO_8859_5| ISO 8859-5, ECMA 113| Latin/Cyrillic| Yes| 1| -
ISO_8859_6| ISO 8859-6, ECMA 114| Latin/Arabic| Yes| 1| -
ISO_8859_7| ISO 8859-7, ECMA 118| Latin/Greek| Yes| 1| -
ISO_8859_8| ISO 8859-8, ECMA 121| Latin/Hebrew| Yes| 1| -
JOHAB| JOHAB| Korean (Hangul)| No| 1-3| -
KOI8R| KOI8-R| Cyrillic (Russian)| Yes| 1| KOI8
KOI8U| KOI8-U| Cyrillic (Ukrainian)| Yes| 1| -
LATIN1| ISO 8859-1, ECMA 94| Western European| Yes| 1| ISO88591
LATIN2| ISO 8859-2, ECMA 94| Central European| Yes| 1| ISO88592
LATIN3| ISO 8859-3, ECMA 94| South European| Yes| 1| ISO88593
LATIN4| ISO 8859-4, ECMA 94| North European| Yes| 1| ISO88594
LATIN5| ISO 8859-9, ECMA 128| Turkish| Yes| 1| ISO88599
LATIN6| ISO 8859-10, ECMA 144| Nordic| Yes| 1| ISO885910
LATIN7| ISO 8859-13| Baltic| Yes| 1| ISO885913
LATIN8| ISO 8859-14| Celtic| Yes| 1| ISO885914
LATIN9| ISO 8859-15| LATIN1 with Euro and accents| Yes| 1| ISO885915
LATIN10| ISO 8859-16, ASRO SR 14111| Romanian| Yes| 1| ISO885916
MULE_INTERNAL| Mule internal code| Multilingual Emacs| Yes| 1-4|
SJIS| Shift JIS| Japanese| No| 1-2| Mskanji, ShiftJIS, WIN932, Windows932
SHIFT_JIS_2004| Shift JIS, JIS X 0213| Japanese| No| 1-2| -
SQL_ASCII| unspecified (see text)| any| Yes| 1| -
UHC| Unified Hangul Code| Korean| No| 1-2| WIN949, Windows949
UTF8| Unicode, 8-bit| all| Yes| 1-4| Unicode
WIN866| Windows CP866| Cyrillic| Yes| 1| ALT
WIN874| Windows CP874| Thai| Yes| 1| -
WIN1250| Windows CP1250| Central European| Yes| 1| -
WIN1251| Windows CP1251| Cyrillic| Yes| 1| WIN
WIN1252| Windows CP1252| Western European| Yes| 1| -
WIN1253| Windows CP1253| Greek| Yes| 1| -
WIN1254| Windows CP1254| Turkish| Yes| 1| -
WIN1255| Windows CP1255| Hebrew| Yes| 1| -
WIN1256| Windows CP1256| Arabic| Yes| 1| -
WIN1257| Windows CP1257| Baltic| Yes| 1| -
WIN1258| Windows CP1258| Vietnamese| Yes| 1| ABC, TCVN, TCVN5712, VSCII
与中文编码排序相关的包括 GB18030, EUC_CN, GBK, BIG5, EUC_TW
简体常用的包括GBK, EUC_CN。
在PostgreSQL中,如果要将字符从一个编码转换为另一个编码,需要告诉数据库(create conversion)怎么转换(使用什么C函数),PG内置了一些转换的C函数和转换方法。
https://www.postgresql.org/docs/9.6/static/catalog-pg-conversion.html
pg_conversion
Name| Type| References| Description
---|---|---|---
oid| oid| - | Row identifier (hidden attribute; must be explicitly selected)
conname| name| - | Conversion name (unique within a namespace)
connamespace| oid| pg_namespace.oid| The OID of the namespace that contains this conversion
conowner| oid| pg_authid.oid| Owner of the conversion
conforencoding| int4| - | Source encoding ID
contoencoding| int4| - | Destination encoding ID
conproc| regproc| pg_proc.oid| Conversion procedure
condefault| bool| - | True if this is the default conversion
查看内置的转换方法
可以看到utf8转换为中文编码的都支持了
postgres=> select * from pg_conversion where conname ~* 'gbk|gb18|euc_cn|euc_tw|big5' order by 1;
conname | connamespace | conowner | conforencoding | contoencoding | conproc | condefault
-----------------+--------------+----------+----------------+---------------+-----------------+------------
big5_to_euc_tw | 11 | 10 | 36 | 4 | big5_to_euc_tw | t
big5_to_mic | 11 | 10 | 36 | 7 | big5_to_mic | t
big5_to_utf8 | 11 | 10 | 36 | 6 | big5_to_utf8 | t
euc_cn_to_mic | 11 | 10 | 2 | 7 | euc_cn_to_mic | t
euc_cn_to_utf8 | 11 | 10 | 2 | 6 | euc_cn_to_utf8 | t
euc_tw_to_big5 | 11 | 10 | 4 | 36 | euc_tw_to_big5 | t
euc_tw_to_mic | 11 | 10 | 4 | 7 | euc_tw_to_mic | t
euc_tw_to_utf8 | 11 | 10 | 4 | 6 | euc_tw_to_utf8 | t
gb18030_to_utf8 | 11 | 10 | 39 | 6 | gb18030_to_utf8 | t
gbk_to_utf8 | 11 | 10 | 37 | 6 | gbk_to_utf8 | t
mic_to_big5 | 11 | 10 | 7 | 36 | mic_to_big5 | t
mic_to_euc_cn | 11 | 10 | 7 | 2 | mic_to_euc_cn | t
mic_to_euc_tw | 11 | 10 | 7 | 4 | mic_to_euc_tw | t
utf8_to_big5 | 11 | 10 | 6 | 36 | utf8_to_big5 | t
utf8_to_euc_cn | 11 | 10 | 6 | 2 | utf8_to_euc_cn | t
utf8_to_euc_tw | 11 | 10 | 6 | 4 | utf8_to_euc_tw | t
utf8_to_gb18030 | 11 | 10 | 6 | 39 | utf8_to_gb18030 | t
utf8_to_gbk | 11 | 10 | 6 | 37 | utf8_to_gbk | t
(18 rows)
注意数据库版本
PostgreSQL 8.x(如Greenplum), 将字符串从原编码转换为指定编码的字符串返回。
可能存在显示的问题。
List of functions
Schema | Name | Result data type | Argument data types | Type
------------+---------------+------------------+---------------------+--------
pg_catalog | convert | text | text, name | normal
pg_catalog | convert | text | text, name, name | normal
PostgreSQL 9.x, 将源编码字符串的字节流转换为指定编码的字符串的字节流返回。
避免了显示的问题。
List of functions
Schema | Name | Result data type | Argument data types | Type
------------+--------------+------------------+---------------------+--------
pg_catalog | convert | bytea | bytea, name, name | normal
pg_catalog | convert_from | text | bytea, name | normal
pg_catalog | convert_to | bytea | text, name | normal
如果8.x需要避免显示问题,返回字节流,可以这样使用,推荐使用。
byteain(textout(convert(字符,'源编码','目标编码')))
当前数据库编码为UTF-8,中文排序未按拼音排序。
```
postgres=# \l
List of databases
Name | Owner | Encoding | Collate | Ctype | Access privileges
-----------+----------+----------+---------+-------+-----------------------
db0 | postgres | UTF8 | C | C |
postgres | postgres | UTF8 | C | C |
template0 | postgres | UTF8 | C | C | =c/postgres +
| | | | | postgres=CTc/postgres
template1 | postgres | UTF8 | C | C | =c/postgres +
| | | | | postgres=CTc/postgres
(4 rows)
postgres=# select * from (values ('刘德华'), ('刘少奇')) t(id) order by id;
id
刘少奇
刘德华
(2 rows)
```
按拼音排序方法(目标可以改成EUC_CN)
```
8.x
postgres=> select * from (values ('刘德华'), ('刘少奇')) t(id) order by byteain(textout(convert(id,'UTF-8','GBK')));
id
刘德华
刘少奇
(2 rows)
9.x
postgres=# select * from (values ('刘德华'), ('刘少奇')) t(id) order by convert(id::bytea,'UTF-8','GBK');
id
刘德华
刘少奇
(2 rows)
```
中文有一些多音字,比如重庆(chongqing), 但是编码时它可能是按zhong编码的,所以看这个例子。
```
postgres=> select * from (values ('中山'), ('重庆')) t(id) order by byteain(textout(convert(id,'UTF-8','GBK')));
id
中山
重庆
(2 rows)
```
表达式索引即可,使用immutable function.
```
8.x
postgres=> \df+ convert
List of functions
Schema | Name | Result data type | Argument data types | Type | Data access | Volatility | Owner | Language | Source code | Description
------------+---------+------------------+---------------------+--------+-------------+------------+-----------+----------+-------------+---------------------------------------------------------
pg_catalog | convert | text | text, name | normal | no sql | stable | xxx | internal | pg_convert | convert string with specified destination encoding name
pg_catalog | convert | text | text, name, name | normal | no sql | stable | xxx | internal | pg_convert2 | convert string with specified encoding names
(2 rows)
9.x
postgres=# \df+ convert
List of functions
Schema | Name | Result data type | Argument data types | Type | Volatility | Parallel | Owner | Security | Access privileges | Language | Source code | Description
------------+---------+------------------+---------------------+--------+------------+----------+----------+----------+-------------------+----------+-------------+----------------------------------------------
pg_catalog | convert | bytea | bytea, name, name | normal | stable | safe | postgres | invoker | | internal | pg_convert | convert string with specified encoding names
(1 row)
```
src/backend/utils/mb/mbutils.c
```
/
* Convert string between two arbitrary encodings.
*
* BYTEA convert(BYTEA string, NAME src_encoding_name, NAME dest_encoding_name)
/
Datum
pg_convert(PG_FUNCTION_ARGS)
{
bytea string = PG_GETARG_BYTEA_PP(0);
char src_encoding_name = NameStr(PG_GETARG_NAME(1));
int src_encoding = pg_char_to_encoding(src_encoding_name);
char dest_encoding_name = NameStr(PG_GETARG_NAME(2));
int dest_encoding = pg_char_to_encoding(dest_encoding_name);
const char src_str;
char dest_str;
bytea retval;
int len;
if (src_encoding <0)
ereport(ERROR,
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
errmsg("invalid source encoding name \"%s\"",
src_encoding_name)));
if (dest_encoding <0)
ereport(ERROR,
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
errmsg("invalid destination encoding name \"%s\"",
dest_encoding_name)));
/* make sure that source string is valid */
len = VARSIZE_ANY_EXHDR(string);
src_str = VARDATA_ANY(string);
pg_verify_mbstr_len(src_encoding, src_str, len, false);
/* perform conversion */
dest_str = (char *) pg_do_encoding_conversion((unsigned char *) src_str,
len,
src_encoding,
dest_encoding);
/* update len if conversion actually happened */
if (dest_str != src_str)
len = strlen(dest_str);
/*
* build bytea data type structure.
*/
retval = (bytea *) palloc(len + VARHDRSZ);
SET_VARSIZE(retval, len + VARHDRSZ);
memcpy(VARDATA(retval), dest_str, len);
if (dest_str != src_str)
pfree(dest_str);
/* free memory if allocated by the toaster */
PG_FREE_IF_COPY(string, 0);
PG_RETURN_BYTEA_P(retval);
}
```
使用binary存储格式排序,只能通过编码来修正排序顺序。
除此之外,我们还可以在不改编码的情况下,使用字段或者order by的collate语法来修正排序顺序。
例子
```
select * from pg_collation ;
设置列级collate
create table a (c1 text collate "zh_CN.utf8");
修改列collate,会导致rewrite table
alter table a alter c1 type text COLLATE "zh_CN.utf8";
设置排序级collate
test=# select * from a order by c1 collate "C";
c1
刘少奇
刘德华
(2 rows)
test=# select * from a order by c1 collate "zh_CN.utf8";
c1
刘德华
刘少奇
(2 rows)
设置operator collate
test=# select * from a where c1 > '刘少奇' collate "C";
c1
刘德华
(1 row)
test=# select * from a where c1 > '刘少奇' collate "zh_CN.utf8";
c1
(0 rows)
设置库级collate
postgres=# create database test with template template0 encoding 'UTF8' lc_collate 'zh_CN.utf8';
postgres=# \c test
You are now connected to database "test" as user "postgres".
test=# create table a (c1 text);
CREATE TABLE
test=# insert into a values ('刘德华'),('刘少奇');
INSERT 0 2
test=# select * from a order by c1;
c1
刘德华
刘少奇
(2 rows)
注意索引和创建索引时的collate必须一致,才能使用该索引
postgres=# create table a(c1 text);
CREATE TABLE
postgres=# create index idxa on a(c1 collate "zh_CN.utf8");
CREATE INDEX
postgres=# explain select * from a order by c1;
QUERY PLAN
Sort (cost=10000000094.38..10000000097.78 rows=1360 zh_CN.utf8";
QUERY PLAN
Index Only Scan using idxa on a (cost=0.15..31.55 rows=1360 s github - 公益是一辈子的事.