热门标签 | HotTags
当前位置:  开发笔记 > 编程语言 > 正文

feat:countcharsinsteadofbytesinstr_width

While

While

1
str_width

doesn’t attempt to handle all Unicode characters, it can handle some characters by counting

1
char

s instead of bytes.

This allows it handle characters like “æøå” (Danish), “äöü” (German), and other 1-column characters with a precomposed representation in Unicode. Where characters take up multiple bytes in the UTF-8 encoding, though they still decode to a single

1
char

. Wikipedia of course has a handy list:

https://en.wikipedia.org/wiki/List_of_precomposed_Latin_characters_in_Unicode

Emojis are also encoded as multiple bytes in UTF-8. With this change, they will also be counted as having a width of 1 column, whereas they’re often displayed using 2 columns.

The existing code can thus be said to take a conservative approach since it will only ever over-estimate the width of a string.

该提问来源于开源项目:clap-rs/clap

Okay, I've played around this and it is doable to pick a cut-off like suggested above. This is now implemented in textwrap as the




1
textwrap::core::display_width

function. With such a simple function, Latin-1 plus emojis seems to be handled quite well. I am less sure about the coverage for East-Asian languages like Chinese or Japanese (but they were also not the target of this hack).

I will make a new release of textwrap soon (1-2 weeks perhaps) and if there is interest, then I could update this PR to use the




1
display_width

function. This would remove the direct dependency on unicode-width from clap and so to speak export this responsibility to textwrap.


   



推荐阅读
author-avatar
1098502132_027279
这个家伙很懒,什么也没留下!
PHP1.CN | 中国最专业的PHP中文社区 | DevBox开发工具箱 | json解析格式化 |PHP资讯 | PHP教程 | 数据库技术 | 服务器技术 | 前端开发技术 | PHP框架 | 开发工具 | 在线工具
Copyright © 1998 - 2020 PHP1.CN. All Rights Reserved | 京公网安备 11010802041100号 | 京ICP备19059560号-4 | PHP1.CN 第一PHP社区 版权所有