phptrim源码分析

本文同时发表于https://github.com/zhangyachen/zhangyachen.github.io/issues/9

核心代码如下：

/* {{{ php_trim()
 * mode 1 : trim left
 * mode 2 : trim right
 * mode 3 : trim left and right
 * what indicates which chars are to be trimmed. NULL->default (\' \t\n\r\v\0\')
 */
PHPAPI char *php_trim(char *c, int len, char *what, int what_len, zval *return_value, int mode TSRMLS_DC)
{
    register int i;
    int trimmed = 0;
    char mask[256];

    if (what) {
        php_charmask((unsigned char*)what, what_len, mask TSRMLS_CC);
    } else {
        php_charmask((unsigned char*)" \n\r\t\v\0", 6, mask TSRMLS_CC);
    }
        //从左开始
    if (mode & 1) {
        for (i = 0; i = 0; i--) {
            if (mask[(unsigned char)c[i]]) {
                len--;
            } else {
                break;
            }
        }
    }

    if (return_value) {
                //把c指针现在指向的位置以后的len个字符返回
        RETVAL_STRINGL(c, len, 1);
    } else {
        return estrndup(c, len);
    }
    return "";
}

可以看出，在php_trim函数内部调用了php_charmask函数

/* {{{ php_charmask
 * Fills a 256-byte bytemask with input. You can specify a range like \'a..z\',
 * it needs to be incrementing.
 * Returns: FAILURE/SUCCESS whether the input was correct (i.e. no range errors)
 */
static inline int php_charmask(unsigned char *input, int len, char *mask TSRMLS_DC)
{
    unsigned char *end;
    unsigned char c;
    int result = SUCCESS;

    memset(mask, 0, 256);      //初始化一个长度为256的hash表
    for (end = input+len; input = c) {
            memset(mask+c, 1, input[3] - c + 1);
            input+=3;
        } else if ((input+1 = input) { /* there was no \'left\' char */
                php_error_docref(NULL TSRMLS_CC, E_WARNING, "Invalid \'..\'-range, no character to the left of \'..\'");
                result = FAILURE;
                continue;
            }
            if (input+2 >= end) { /* there is no \'right\' char */
                php_error_docref(NULL TSRMLS_CC, E_WARNING, "Invalid \'..\'-range, no character to the right of \'..\'");
                result = FAILURE;
                continue;
            }
            if (input[-1] > input[2]) { /* wrong order */
                php_error_docref(NULL TSRMLS_CC, E_WARNING, "Invalid \'..\'-range, \'..\'-range needs to be incrementing");
                result = FAILURE;
                continue;
            }
            /* FIXME: better error (a..b..c is the only left possibility?) */
            php_error_docref(NULL TSRMLS_CC, E_WARNING, "Invalid \'..\'-range");
            result = FAILURE;
            continue;
        } else {
                        //对应的位置为1
            mask[c]=1;
        }
    }
    return result;
}

可以看出trim函数的逻辑：
1 声明一个长度为256的hash表。
2 将character_mask中每个字节转化为ascii码，将hash表中ascii码对应key的value设置为1。
3 从头部遍历str中每个字节，若遍历到字节对应的ascii码在hash表中存在，则str长度位置减1；若不存在，就中断循环。
4 从尾部遍历str中每个字节，逻辑同3。

案例分析：
trim("广东省","省")导致乱码

首先获得"广东省"的十六进制表示


$str = "广东省";

printf("str[%s] hex_str[%s]\n", \(str, get_hex_str(\)str));
\(str = trim(\)str, "省");

printf("str[%s] hex_str[%s]\n", \(str, get_hex_str(\)str));
输出：

str[广东省] hex_str[\xE5\xB9\xBF\xE4\xB8\x9C\xE7\x9C\x81]

str[广hex_str[\xE5\xB9\xBF\xE4\xB8]
utf-8编码下汉字对应三个字节，“东”的编码为e4 b8 9c，“省”的编码为e7 9c 81。

trim("广东省", "省"); 函数处理时不是以我们看到的中文字符为一个单位，而是以字节为单位。

相等于从e5 b9 bf e4 b8 9c e7 9c 81开头和结尾去掉包含在e7 9c 81的字节，这样“东”的第三个字节就会被切掉，就会有上述的输出了。
如果想将中文字符串中部分字符去掉，建议使用str_replace。