html - python中怎么获取某个网页元素之前的所有源码?

 书友14395217 发布于 2022-10-27 18:30

    
        The Dormouse's story 
     
     
        

p1p1p1 b1b1b1

p2p2p2

    u1u1u1
a1a1a1

a2a2a2 b2b2b2

p3p3p3

a3a3a3

p4p4p4

比如第一个a元素:a#a1,要获取这个元素以上的所有网页源码:


    
        The Dormouse's story 
     
     
        

p1p1p1 b1b1b1

p2p2p2

    u1u1u1
a1a1a1

5 个回答
  • be模块最顺手

    2022-10-29 04:07 回答
  • 新手,我只学了re模块,所以只用re模块+普通方式来提取

    >>> html = '''
    <html>
        <head>
            <title>The Dormouse's story </title>
        </head> 
        <body> 
            <p id="p1">p1p1p1
                <b id='b1'>b1b1b1</b>
            </p> 
            <p id="p2">p2p2p2
                <ul id='u1'>u1u1u1</ul>
                <a id="a1">a1a1a1</a>
                <p id='d1'>
                    <a id="a2">a2a2a2 </a>
                    <b id='b2'>b2b2b2</b>
                    <p id='p3'>p3p3p3</p>
                </p>
                <a id="a3">a3a3a3 </a>
            </p> 
            <p id="p4">p4p4p4</p>
        </body>
    </html>
    '''
    >>> html_1 = re.search(r'<a',html)
    >>> html_1 = html_1.span()
    >>> print(html[:html_1[0]])
    
    <html>
        <head>
            <title>The Dormouse's story </title>
        </head> 
        <body> 
            <p id="p1">p1p1p1
                <b id='b1'>b1b1b1</b>
            </p> 
            <p id="p2">p2p2p2
                <ul id='u1'>u1u1u1</ul>
                
    >>> 
    2022-10-29 04:15 回答
  • from bs4 import BeautifulSoup as bs
    
    def dropAllNextEle(eleOfBS, returnTrueOrFalseToKeepOrDropEleFunc = None):
        # 删除ele元素之后的所有节点元素(其实就是递归删除eleOfBS及由近及远历代父元素的兄弟元素);第二个参数是个函数,以第一个参数的各级兄弟元素为参数,返回true,保留ele,否则删除ele.
        if eleOfBS is None: return
        if eleOfBS.name == 'body': return
        next_siblings = eleOfBS.next_siblings
        if next_siblings:
            next_siblings_list = []
            for item in next_siblings:
                if item:
                    next_siblings_list.insert(0, item)
    
            for item in next_siblings_list:
                if returnTrueOrFalseToKeepOrDropEleFunc:
                    if not returnTrueOrFalseToKeepOrDropEleFunc(item):
                        item.replace_with('')
                else:
                    item.replace_with('')
    
            dropAllNextEle(eleOfBS.parent, returnTrueOrFalseToKeepOrDropEleFunc)
        else:
            dropAllNextEle(eleOfBS.parent, returnTrueOrFalseToKeepOrDropEleFunc)
            
    soup = bs(html_source, 'html5lib')
    a1_ele = soup.find('a', id = 'a1')
    dropAllNextEle(a1_ele, lambda item: type(item) == type(soup.new_string('strstr')))
    print soup
    2022-10-29 04:19 回答
  • 使用bs4去提取

    2022-10-29 04:22 回答
  • 由于你原来的html不合规范,我改了点。
    下面是用 lxml 做的。

    doc = '''
    <html>
        <head>
            <title>The Dormouse's story </title>
        </head> 
        <body> 
            <p id="p1">p1p1p1
                <b id='b1'>b1b1b1</b>
            </p> 
            <p id="p2">p2p2p2</p>
            <p id='d1'>
                <ul id='u1'>u1u1u1</ul>
                <a id="a1">a1a1a1</a>
                <p id='d2'>
                    <a id="a2">a2a2a2 </a>
                    <b id='b2'>b2b2b2</b>
                    <p id='p3'>p3p3p3</p>
                </p>
                <a id="a3">a3a3a3 </a>
            </p> 
            <p id="p4">p4p4p4</p>
        </body>
    </html>
    '''
    
    from lxml import html
    
    tree = html.fromstring(doc)
    a = tree.get_element_by_id("a1")
    print(html.tostring(a))
    print(html.tostring(tree).decode())
    
    def dropnode(e=None):
        if e is None: return
        if e.tag == 'body': return
        nd = e.getnext()
        while nd is not None:
            nd.drop_tree()
            nd = e.getnext()
        dropnode(e.getparent())
    
    dropnode(a)
    print(html.tostring(tree).decode()) 
    2022-10-29 04:23 回答
撰写答案
今天,你开发时遇到什么问题呢?
立即提问
热门标签
PHP1.CN | 中国最专业的PHP中文社区 | PNG素材下载 | DevBox开发工具箱 | json解析格式化 |PHP资讯 | PHP教程 | 数据库技术 | 服务器技术 | 前端开发技术 | PHP框架 | 开发工具 | 在线工具
Copyright © 1998 - 2020 PHP1.CN. All Rights Reserved 京公网安备 11010802041100号 | 京ICP备19059560号-4 | PHP1.CN 第一PHP社区 版权所有