Python自用代码（某方标准类网页源代码清洗）

作者：aihyuksj_967 | 来源：互联网 | 2023-05-19 16:01

用于mongodb中“标准”数据的清洗，数据为网页源代码，须从中提取：标准名称,标准外文名称,标准编号,发布单位,发布日期,状态,实施日期,开本页数,采用关系,中图分类号,中国标准

用于mongodb中“标准”数据的清洗，数据为网页源代码，须从中提取：

标准名称,标准外文名称,标准编号,发布单位,发布日期,状态,实施日期,开本页数,采用关系,中图分类号,中国标准分类号,国际标准分类号,国别,关键词,摘要,替代标准。

提取后组成字典存入另一集合。

#coding=utf-8
from pymongo import MongoClient
from lxml import etree
import requests

s = [u'标准编号：',u'发布单位：',u'发布日期：',u'状态：',u'实施日期：',u'开本页数：',u'采用关系：',
    u'中图分类号：',u'中国标准分类号：',u'国际标准分类号：',u'国别：',u'关键词：',u'摘要：']

# 获取数据库
def get_db():
    client = MongoClient('IP', 27017)
    db = client.wanfang
    db.authenticate("用户名","密码") 
    return db

# 获取第num条数据
def get_data(table, num):
    i = 1
    for item in table.find({}, {"content":1,"_id":0}):
        if i==num:
            if item.has_key('content') and item['content']:
                return item['content']
        else:
            i+=1
            continue

# 列表转字符串
def list_str(list):
    if len(list)!=0:
        return list[0]
    else:
        return ""

# 提取分类号
def code_ls(list):
    if len(list)!=0:
        ls = list[0].split()
        shanchu = []
        for i in ls:
            if ("("in i) or (")"in i) or ("（"in i) or("）"in i):
                shanchu.append(i)
        for i in shanchu:
            ls.remove(i)
        return ls
    else:
        return ""

# 构造关键词列表
def keywords_ls(list):
    if len(list)!=0:
        return list
    else:
        return ""

# 替代标准
def replace_str(replace):
    if replace!="":
        ls = [i.strip().replace("\r\n", "") for i in replace]
        if len(ls)!=0:
            return ls[0][5:]
        else:
            return ""
    else:
        return ""

# 提取摘要
def summary_str(list):
    if len(list)!=0:
        if list[0][0]!="<":
            return list[0]
        else:
            return ""
    else:
        return ""

# 调整日期格式
def date_str(list):
    if len(list)!=0:
        year = list[0].find(u'年')
        month = list[0].find(u'月')
        day = list[0].find(u'日')
        if month-year==2:
            list[0] = list[0].replace(u"年",u"年0")
        if day-mOnth==2:
            list[0] = list[0].replace(u"月",u"月0")
        return list[0].replace(u"日","").replace(u"月","-").replace(u"年","-")
    else:
        return ""

# 调整采标格式
def adopted_ls(string, ls):
    dc = {}
    loc = string.find(',')
    if loc==-1:
        return ls
    else:
        dc["code"] = string[:loc].strip()
        dc["type"] = string[loc+1:loc+4]
        ls.append(dc)
        return adopted_ls(string[loc+4:],ls)

# 构造标准入库字典
def standard_dict(html):
    dc = {}
    tree = etree.HTML(html)
    # 标准名称
    dc["title"] = list_str(tree.xpath("//h1/text()"))
    # 外文名称
    dc["title_eng"] = list_str(tree.xpath("//h2/text()"))
    # 标准编号
    dc["standard_number"] = list_str(tree.xpath("//span[text()='%s']/following-sibling::*/text()"%(s[0])))
    # 发布单位
    dc["publishing_department"] = list_str(tree.xpath("//span[text()='%s']/following-sibling::*/text()"%(s[1])))
    # 发布日期
    dc["release_date"] = date_str(tree.xpath("//span[text()='%s']/following-sibling::*/text()"%(s[2])))
    # 状态
    dc["state"] = list_str(tree.xpath("//span[text()='%s']/following-sibling::*/text()"%(s[3])))
    # 实施日期
    dc["enforcement_date"] = date_str(tree.xpath("//span[text()='%s']/following-sibling::*/text()"%(s[4])))
    # 开本页数
    dc["pages"] = list_str(tree.xpath("//span[text()='%s']/following-sibling::*/text()"%(s[5])))
    # 采用关系
    dc["adopted"] = adopted_ls(list_str(tree.xpath("//span[text()='%s']/following-sibling::*/text()"%(s[6]))), [])
    # 中图分类号
    dc["clc"] = code_ls(tree.xpath("//span[text()='%s']/following-sibling::*/text()"%(s[7])))
    # 中国标准分类号
    dc["ccs"] = code_ls(tree.xpath("//span[text()='%s']/following-sibling::*/child::*/text()"%(s[8])))
    # 国际标准分类号
    dc["ics"] = code_ls(tree.xpath("//span[text()='%s']/following-sibling::*/text()"%(s[9])))
    # 国别
    dc["country"] = list_str(tree.xpath("//span[text()='%s']/following-sibling::*/text()"%(s[10])))
    # 关键词
    dc["keywords"] = keywords_ls(tree.xpath("//span[text()='%s']/following-sibling::*/child::*/text()"%(s[11])))
    # 摘要
    dc["summary"] = summary_str(tree.xpath("//span[text()='%s']/parent::*/following-sibling::*/text()"%(s[12])))
    # 替代标准
    dc["replace_for"] = replace_str(tree.xpath("//div[@id='replaceStandard']//child::*//text()"))
    return dc

# 主函数
def main():
    db = get_db()
    collection=db.standard
    collection2 = db.standard_cleaned
    for item in collection.find({}, {"content":1,"_id":0}):
        if item.has_key('content') and item['content']:
            dc = standard_dict(item['content'])
            collection2.insert(dc)

if __name__ == '__main__':
    main()
    
    # 以下代码用于测试清洗特定一条数据
    # db = get_db()
    # collection=db.standard
    # collection2 = db.standard_cleaned
    # data = get_data(collection, 8)
    # dc = standard_dict(data)
    # collection2.insert(dc)
    # for k,v in dc.items():
    #     print k,v

    # # 以下代码用于测试提取摘要
    # data = requests.get('http://d.wanfangdata.com.cn/Standard/ISO%208528-5-2013')
    # dc = standard_dict(data.text)
    # for k,v in dc.items():
    #     print k,v

    # # 以下代码用于测试修改日期格式
    # l1 = [u"2017年6月28日"]
    # l2 = [u"2017年10月27日"]
    # l3 = [u"2017年12月1日"]
    # l4 = [u"2017年7月1日"]
    # print date_str(l1)
    # print date_str(l2)
    # print date_str(l3)
    # print date_str(l4)

推荐阅读

python
Python 爬虫基础教程及代码实例

根据最新发布的《互联网人才趋势报告》，尽管大量IT从业者已转向Python开发，但随着人工智能和大数据领域的迅猛发展，仍存在巨大的人才缺口。本文将详细介绍如何使用Python编写一个简单的爬虫程序，并提供完整的代码示例。 ... [详细]

蜡笔小新 2024-12-26 10:42:40
python
Go+ 中的上下文处理指南

本文详细介绍 Go+ 编程语言中的上下文处理机制，涵盖其基本概念、关键方法及应用场景。Go+ 是一门结合了 Go 的高效工程开发特性和 Python 数据科学功能的编程语言。 ... [详细]

蜡笔小新 2024-12-28 11:05:31
foreach
Akka BackoffSupervisor的深入解析与实践

本文详细介绍了Akka中的BackoffSupervisor机制，探讨其在处理持久化失败和Actor重启时的应用。通过具体示例，展示了如何配置和使用BackoffSupervisor以实现更细粒度的异常处理。 ... [详细]

蜡笔小新 2024-12-27 15:04:09
copy
Python自动化处理：从Word文档提取内容并生成带水印的PDF

本文介绍如何利用Python实现从特定网站下载Word文档，去除水印并添加自定义水印，最终将文档转换为PDF格式。该方法适用于批量处理和自动化需求。 ... [详细]

蜡笔小新 2024-12-27 13:10:20
copy
编写有趣的VBScript恶作剧脚本

本文将介绍如何编写一些有趣的VBScript脚本，这些脚本可以在朋友之间进行无害的恶作剧。通过简单的代码示例，帮助您了解VBScript的基本语法和功能。 ... [详细]

蜡笔小新 2024-12-28 09:46:23
future
Transforming the Future of Virtual Worlds

Explore how Matterverse is redefining the metaverse experience, creating immersive and meaningful virtual environments that foster genuine connections and economic opportunities. ... [详细]

蜡笔小新 2024-12-28 09:44:49
string
Handling Null Object Encoding in OAuth 1.0a API Implementation

Explore a common issue encountered when implementing an OAuth 1.0a API, specifically the inability to encode null objects and how to resolve it. ... [详细]

蜡笔小新 2024-12-28 08:54:34
python
技术分享：从动态网站提取站点密钥的解决方案

本文探讨了如何从动态网站中提取站点密钥，特别是针对验证码（reCAPTCHA）的处理方法。通过结合Selenium和requests库，提供了详细的代码示例和优化建议。 ... [详细]

蜡笔小新 2024-12-28 04:11:47
python
Python 的 10 个开发技巧！太实用了

1.如何在运行状态查看源代码？查看函数的源代码，我们通常会使用IDE来完成。比如在PyCharm中，你可以Ctrl+鼠标点击进入函数的源代码。那如果没有IDE呢？当我们想使用一个函 ... [详细]

蜡笔小新 2024-12-27 18:36:54
require
Yii2 GridView 实现列表页数据直接编辑的完整指南

本文详细介绍了如何使用 Yii2 的 GridView 组件在列表页面实现数据的直接编辑功能。通过具体的代码示例和步骤，帮助开发者快速掌握这一实用技巧。 ... [详细]

蜡笔小新 2024-12-27 16:27:52
string
使用 Azure Service Principal 和 Microsoft Graph API 获取 AAD 用户列表

本文介绍了一段通用代码示例，该代码不仅能够操作 Azure Active Directory (AAD)，还可以通过 Azure Service Principal 的授权访问和管理 Azure 订阅资源。Azure 的架构可以分为两个层级：AAD 和 Subscription。 ... [详细]

蜡笔小新 2024-12-27 16:07:12
string
MyBatis：深入解析与应用

在当前众多持久层框架中，MyBatis（前身为iBatis）凭借其轻量级、易用性和对SQL的直接支持，成为许多开发者的首选。本文将详细探讨MyBatis的核心概念、设计理念及其优势。 ... [详细]

蜡笔小新 2024-12-27 12:17:16
string
将Web服务部署到Tomcat

本文介绍了如何在JDeveloper 12c中创建一个Java项目，并将其打包为Web服务，然后部署到Tomcat服务器。内容涵盖从项目创建、编写Web服务代码、配置相关XML文件到最终的本地部署和验证。 ... [详细]

蜡笔小新 2024-12-27 11:48:15
string
RecyclerView初步学习(一)

RecyclerView初步学习(一)ReCyclerView提供了一种插件式的编程模式，除了提供ViewHolder缓存模式，还可以自定义动画，分割符，布局样式，相比于传统的ListVi ... [详细]

蜡笔小新 2024-12-26 20:24:01
string
分页插件3指定到某一页

前言--页数多了以后需要指定到某一页（只做了功能，样式没有细调）html ... [详细]

蜡笔小新 2024-12-27 15:19:01

aihyuksj_967

这个家伙很懒，什么也没留下！

Tags | 热门标签

RankList | 热门文章