作者:1098502132_027279 | 来源:互联网 | 2023-06-25 01:17
While
doesn’t attempt to handle all Unicode characters, it can handle some characters by counting
s instead of bytes.
This allows it handle characters like “æøå” (Danish), “äöü” (German), and other 1-column characters with a precomposed representation in Unicode. Where characters take up multiple bytes in the UTF-8 encoding, though they still decode to a single
. Wikipedia of course has a handy list:
https://en.wikipedia.org/wiki/List_of_precomposed_Latin_characters_in_Unicode
Emojis are also encoded as multiple bytes in UTF-8. With this change, they will also be counted as having a width of 1 column, whereas they’re often displayed using 2 columns.
The existing code can thus be said to take a conservative approach since it will only ever over-estimate the width of a string.
该提问来源于开源项目:clap-rs/clap
Okay, I've played around this and it is doable to pick a cut-off like suggested above. This is now implemented in textwrap as the
1
| textwrap::core::display_width |
function. With such a simple function, Latin-1 plus emojis seems to be handled quite well. I am less sure about the coverage for East-Asian languages like Chinese or Japanese (but they were also not the target of this hack).
I will make a new release of textwrap soon (1-2 weeks perhaps) and if there is interest, then I could update this PR to use the
function. This would remove the direct dependency on unicode-width from clap and so to speak export this responsibility to textwrap.