我编写了一个递归函数,它将以以下格式返回字典中标签中所有文本的XPATH:
{'xpath1': {'text': 'text1'}, 'xpath2': {'text': 'text2'}, ...}
码:
from bs4 import BeautifulSoup, NavigableString
def get_xpaths_dict(soup, xpaths={}, curr_path=''):
curr_path += '/{}'.format(soup.name)
for item in soup.contents:
if isinstance(item, NavigableString):
if item.strip():
try:
xpaths[curr_path]['count'] += 1
count = xpaths[curr_path]['count']
curr_path += '[{}]'.format(count)
xpaths[curr_path] = {'text': item.strip()}
except KeyError:
xpaths[curr_path] = {'text': item.strip(), 'count': 1}
else:
xpaths = get_xpaths_dict(item, xpaths, curr_path)
return xpaths
html = '''
text of div 1
text of span 1.1
text of span 2.1
text of span 2.2
text of span 3'''
soup = BeautifulSoup(html, 'html.parser')
xpaths = get_xpaths_dict(soup.div)
print(xpaths)
输出:
{'/div': {'text': 'text of div 1', 'count': 1}, '/div/span': {'text': 'text of span 1.1', 'count': 1}, '/div/span/span': {'text': 'text of span 2.1', 'count': 2}, '/div/span/span[2]': {'text': 'text of span 2.2'}, '/div/span/span[2]/span': {'text': 'text of span 3', 'count': 1}}
我知道这不是您期望输出的格式.但是,您可以将其转换为所需的任何格式.例如,要将其转换为预期的输出,只需执行以下操作:
expected_output = [(v['text'], k) for k, v in xpaths.items()]
print(expected_output)
输出:
[('text of div 1', '/div'), ('text of span 1.1', '/div/span'), ('text of span 2.1', '/div/span/span'), ('text of span 2.2', '/div/span/span[2]'), ('text of span 3', '/div/span/span[2]/span')]
一些解释:
词典中的额外键计数用于存储当前标签中具有相同名称的标签数量.使用这种格式(字典)可以优化代码.您只能访问每个标签一次.
奖金:
由于该函数返回以XPATH为键的字典,因此您可以使用XPATH获取任何文本.例如:
xpaths = get_xpaths_dict(soup.div)
print(xpaths['/div/span/span[2]/span']['text'])
# text of span 3