我想将.html文件作为原始文本读取,并将包含unicode字符的子字符串的实例替换为另一个子字符串。假设文件mm03.html只包含一行文本:«test»
我想读取mm03.html,将其原始文本解析为字符串,然后调用replace,这样输出结果如下所示:
^{pr2}$
我第一次尝试这样做时,我写了以下代码。。。在# -*- coding: utf-8 -*-
import codecs
htmlBase = codecs.open("mm03.html",'r')
htmlFill = htmlBase.read()
print htmlFill
htmlFill = htmlFill.replace("«test»","TEST")
print htmlFill
htmlBase.close()
…期望它首先打印上面列出的原始行,然后再打印第二行。相反,它列出了第一行两次。在
好吧。所以可能是Unicode解码问题,对吧?也许吧,但当我根据这个网站上找到的与Unicode相关的建议修改代码时,各种阴影的问题仍然存在。此外,通过将htmlBase显式定义为。。。在htmlBase = """«test»"""
…这让我相信在python中读取html文件有些东西我不知道。我尝试过在'w'模式下打开mmo3.html,但这似乎不起作用,而且会破坏原始文件。从只读文件中读取的字符串本身应该是只读的没有多大意义,但我可能错了。在
下面是我仔细研究过的几个脚本/输出对。在在要替换的字符串之前添加未加引号的字符'u'# -*- coding: utf-8 -*-
import codecs
htmlBase = codecs.open("mm03.html",'r')
htmlFill = htmlBase.read()
print htmlFill
htmlFill = htmlFill.replace(u"«test»","TEST")
print htmlFill
htmlBase.close()
输出:½test╗
Traceback (most recent call last):
File "test2.py", line 6, in
htmlFill = htmlFill.replace(u"«test»","TEST")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xab in position 31: ordinal not in range(128)
将.decode('utf-8')应用于从.read()传递的字符串# -*- coding: utf-8 -*-
import codecs
htmlBase = codecs.open("mm03.html",'r')
htmlFill = htmlBase.read().decode('utf-8')
print htmlFill
htmlFill = htmlFill.replace(u"«test»","TEST")
print htmlFill
htmlBase.close()
输出:Traceback (most recent call last):
File "test2.py", line 4, in
htmlFill = htmlBase.read().decode('utf-8')
File "C:\Python27\lib\encodings\utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xab in position 31: invalid start byte
将.encode('utf-8')应用于从.read()传递的字符串# -*- coding: utf-8 -*-
import codecs
htmlBase = codecs.open("mm03.html",'r')
htmlFill = htmlBase.read().encode('utf-8')
print htmlFill
htmlFill = htmlFill.replace(u"«test»","TEST")
print htmlFill
htmlBase.close()
输出:Traceback (most recent call last):
File "test2.py", line 4, in
htmlFill = htmlBase.read().encode('utf-8')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xab in position 31: ordinal not in range(128)
将.decode('utf-8')应用于从.read()传递的字符串,目标子字符串上没有“u”后缀# -*- coding: utf-8 -*-
import codecs
htmlBase = codecs.open("mm03.html",'r')
htmlFill = htmlBase.read().decode('utf-8')
print htmlFill
htmlFill = htmlFill.replace("«test»","TEST")
print htmlFill
htmlBase.close()
输出:Traceback (most recent call last):
File "test2.py", line 4, in
htmlFill = htmlBase.read().decode('utf-8')
File "C:\Python27\lib\encodings\utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xab in position 31: invalid start byte
将.encode('utf-8')应用于从.read()传递的字符串,目标子字符串上没有“u”后缀# -*- coding: utf-8 -*-
import codecs
htmlBase = codecs.open("mm03.html",'r')
htmlFill = htmlBase.read().encode('utf-8')
print htmlFill
htmlFill = htmlFill.replace("«test»","TEST")
print htmlFill
htmlBase.close()
输出:Traceback (most recent call last):
File "test2.py", line 4, in
htmlFill = htmlBase.read().encode('utf-8')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xab in position 31: ordinal not in range(128)