美化汤在版权标志上失败-BeautifulSoupPrettifyfailsoncopyrightsymbol

作者：朱美萍客户 | 来源：互联网 | 2023-10-16 13:39

IamgettingaUnicodeerror:UnicodeEncodeError:charmapcodeccantencodecharacteru\xa9in

I am getting a Unicode error: UnicodeEncodeError: 'charmap' codec can't encode character u'\xa9' in position 822: character maps to

我得到一个Unicode错误:UnicodeEncodeError:“charmap”编解码器不能在位置822对字符u'\xa9'进行编码:字符映射到

This appears to be a standard copyright symbol, and in the HTML is ©. I have not been able to find a way past this. I even tried a custom function to replace copy with a space but that also failed with the same error.

这似乎是一个标准的版权符号，在HTML是©。我一直没能找到一个办法过去。我甚至尝试了一个自定义函数来替换一个空格，但同样的错误也让我失败了。

import sys
import pprint
import mechanize
import COOKIElib
from bs4 import BeautifulSoup
import html2text
import lxml

def MakePretty():

def ChangeCopy(S):
    return S.replace(chr(169)," ")
br = mechanize.Browser()

# COOKIE Jar
cj = COOKIElib.LWPCOOKIEJar()
br.set_COOKIEjar(cj)

# Browser options
br.set_handle_equiv(True)
#br.set_handle_gzip(True)
br.set_handle_redirect(True)
br.set_handle_referer(True)
br.set_handle_robots(False)

# Follows refresh 0 but not hangs on refresh > 0
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)

# User-Agent (this is cheating, ok?)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]

# The site we will navigate into, handling its session
# Open the site
br.open('http://www.thesitewizard.com/faqs/copyright-symbol.shtml')
html = br.response().read()
soup = BeautifulSoup(html)
print soup.prettify()

if __name__ == '__main__':
    MakePretty()

How do I get prettify past the copyright symbol? I have searched all over the web for a solution to no avail (or I might not understand as I am fairly new to Python and scraping).

我怎样才能在经过版权标志后变得更漂亮呢?我已经在web上搜索了所有的解决方案，但是没有任何用处(或者我可能不理解，因为我对Python和抓取非常陌生)。

Thanks for your help.

谢谢你的帮助。

4 个解决方案

#1

I had the same problem. This may work for you:

我也有同样的问题。这可能对你有用:

print soup.prettify().encode('UTF-8')

打印soup.prettify().encode(utf - 8)

#2

The page http://www.thesitewizard.com/faqs/copyright-symbol.shtml is sent without specifying character encoding. The page itself specifies the encoding as ISO-8859-1 in a meta tag, but only after the occurrence of the “©” character. So clients have to make a guess, and the guess may be wrong. If the client guesses UTF-8, then it will see the bit A9, which is a data error in UTF-8 data.

页面http://www.thesitewizard.com/faqs/copyright-symbol.shtml在发送时没有指定字符编码。页面本身将编码指定为iso - 8859 - 1在meta标签,但只有发生后的“©”性格。因此，客户必须进行猜测，而猜测可能是错误的。如果客户机猜测UTF-8，那么它将看到位A9，这是UTF-8数据中的一个数据错误。

So it seems that you need to set the encoding (to ISO-8859-1, or more safely to windows-1252) when reading the data. This is of course an ad hoc solution only; it makes no sense to fix the encoding in general.

因此，在读取数据时，似乎需要将编码设置为ISO-8859-1，或者更安全地设置为windows-1252。这当然是一个特别的解决方案;一般来说，修正编码是没有意义的。

#3

You're using chr(), which is wrong here because it expects ASCII and that goes only as far as 127/0x7F only (despite to popular folklore, ASCII is 7-bit only). 0xA9 / © is Unicode, so you should use unichr(169) instead.

#4

Just changing to unichr in the formatter function didn't work. Ended up using decode(formatter=blah) which did return unformatted html without the copyright symbol. Saved that html and fed that into prettify which did the trick.

仅仅在格式化程序函数中更改为unichr是不行的。最终使用decode(formatter=blah)返回没有版权符号的未格式化的html。保存了html，并将其添加到美化中，这就成功了。

推荐阅读

ip
Web动态服务器Python基本实现

Web动态服务器Python基本实现 ... [详细]

蜡笔小新 2024-11-21 08:01:30
ip
Spring与Quartz结合实现周期性任务调度

本文介绍了一个使用Spring框架和Quartz调度器实现每周定时调用Web服务获取数据的小项目。通过详细配置Spring XML文件，展示了如何设置定时任务以及解决可能遇到的自动注入问题。 ... [详细]

蜡笔小新 2024-11-19 19:14:50
string
JUnit下的测试和suite

nsitionalENhttp:www.w3.orgTRxhtml1DTDxhtml1-transitional.dtd ... [详细]

蜡笔小新 2024-11-21 16:03:49
join
Spring AOP学习笔记Advice执行顺序

一、Advice执行顺序二、Advice在同一个Aspect中三、Advice在不同的Aspect中一、Advice执行顺序如果多个Advice和同一个JointPoint连接& ... [详细]

蜡笔小新 2024-11-21 15:28:36
join
Requests库的基本使用方法

本文介绍了Python中Requests库的基础用法，包括如何安装、GET和POST请求的实现、如何处理Cookies和Headers，以及如何解析JSON响应。相比urllib库，Requests库提供了更为简洁高效的接口来处理HTTP请求。 ... [详细]

蜡笔小新 2024-11-21 13:17:41
tree
在OpenCV 3.1.0中实现SIFT与SURF特征检测

本文介绍如何在OpenCV 3.1.0版本中通过Python 2.7环境使用SIFT和SURF算法进行图像特征点检测。由于这些高级功能在OpenCV 3.0.0及更高版本中被移至额外的contrib模块，因此需要特别处理才能正常使用。 ... [详细]

蜡笔小新 2024-11-20 21:00:18
input
使用Service Locator模式实现高效的服务命名访问

本文探讨了如何通过Service Locator模式来简化和优化在B/S架构中的服务命名访问，特别是对于需要频繁访问的服务，如JNDI和XMLNS。该模式通过缓存机制减少了重复查找的成本，并提供了对多种服务的统一访问接口。 ... [详细]

蜡笔小新 2024-11-20 19:26:30
ip
Python实现文件下载的三种方式

在Python编程中，经常需要处理文件下载的任务。本文将介绍三种常用的下载方法：使用urllib、urllib2以及requests库进行HTTP请求下载，同时也会提及如何通过ftplib从FTP服务器下载文件。 ... [详细]

蜡笔小新 2024-11-20 11:56:32
input
解决Pytesser模块在Windows环境下出现的错误

本文详细探讨了如何解决在Windows环境中使用Pytesser模块进行OCR（光学字符识别）时遇到的WindowsError错误，提供了具体的解决方案。 ... [详细]

蜡笔小新 2024-11-19 11:32:27
string
Go语言中接口型函数的应用与解析

本文深入探讨了Go语言中的接口型函数，通过实例分析其灵活性和强大功能，帮助开发者更好地理解和运用这一特性。 ... [详细]

蜡笔小新 2024-11-21 12:21:19
regex
web: _show -> _info 造轮子编程

问题场景用Java进行web开发过程当中，当遇到很多很多个字段的实体时，最苦恼的莫过于编辑字段的查看和修改界面，发现2个页面存在很多重复信息，能不能写一遍？有没有轮子用都不如自己造。解决方式笔者根据自 ... [详细]

蜡笔小新 2024-11-21 10:21:24
string
spring boot使用jetty无法启动

spring boot使用jetty无法启动 ... [详细]

蜡笔小新 2024-11-21 10:15:52
input
java写简易五子棋游戏。

importjava.io.*;importjava.util.*;publicclass五子棋游戏{staticintm1;staticintn1;staticfinalintS ... [详细]

蜡笔小新 2024-11-20 17:34:54
string
web页面报表js下载,web报表软件

web页面报表js下载,web报表软件 ... [详细]

蜡笔小新 2024-11-16 18:37:21
string
Python IDE 设置字符集及编码转换的最佳实践

本文介绍了如何在 Python 脚本中规范文件编码，并提供了在不同字符集之间进行转换的方法，特别是在处理中文字符时的注意事项。 ... [详细]

蜡笔小新 2024-11-16 13:42:20

朱美萍客户

这个家伙很懒，什么也没留下！

Tags | 热门标签

RankList | 热门文章