爬虫逻辑及数据存储
作者:pokiyo6836 | 来源:互联网 | 2023-05-17 22:29
#爬虫逻辑及数据存储####hbase表结构```mysqlcreateitdaan:post,{NAMEa,VERSIONS1},{NAMEb,VERSION
# 爬虫逻辑及数据存储 #### hbase表结构 ```mysql create '#:post',{NAME=>'a',VERSIOnS=>1},{NAME=>'b',VERSIOnS=>1},{SPLITS=>['01','02','03','04','05','06','07','08','09','0a','0b','0c','0d','0e','0f','10','11','12','13','14','15','16','17','18','19','1a','1b','1c','1d','1e','1f','20','21','22','23','24','25','26','27','28','29','2a','2b','2c','2d','2e','2f','30','31','32','33','34','35','36','37','38','39','3a','3b','3c','3d','3e','3f','40','41','42','43','44','45','46','47','48','49','4a','4b','4c','4d','4e','4f','50','51','52','53','54','55','56','57','58','59','5a','5b','5c','5d','5e','5f','60','61','62','63','64','65','66','67','68','69','6a','6b','6c','6d','6e','6f','70','71','72','73','74','75','76','77','78','79','7a','7b','7c','7d','7e','7f','80','81','82','83','84','85','86','87','88','89','8a','8b','8c','8d','8e','8f','90','91','92','93','94','95','96','97','98','99','9a','9b','9c','9d','9e','9f','a0','a1','a2','a3','a4','a5','a6','a7','a8','a9','aa','ab','ac','ad','ae','af','b0','b1','b2','b3','b4','b5','b6','b7','b8','b9','ba','bb','bc','bd','be','bf','c0','c1','c2','c3','c4','c5','c6','c7','c8','c9','ca','cb','cc','cd','ce','cf','d0','d1','d2','d3','d4','d5','d6','d7','d8','d9','da','db','dc','dd','de','df','e0','e1','e2','e3','e4','e5','e6','e7','e8','e9','ea','eb','ec','ed','ee','ef','f0','f1','f2','f3','f4','f5','f6','f7','f8','f9','fa','fb','fc','fd','fe','ff']} ``` #### hbase字段描述 | 列族 | 列名 | 备注 | | ------ | ---------- | -------- | | rowkey | md5(url) | url的md5值 | | a | title | 标题 | | a | content | 内容 | | a | author | 作者 | | a | postTime | 发布时间 | | a | url | 地址 | | a | tags | 标签 | | a | webId | 网站ID | | a | typeId | 网站类型ID | | a | viewNum | 查看次数 | | a | isCrowd | 是否爬取成功 | | a | crowdTimes | 失败次数 | | a | replies | 回复内容 | | a | crowdTime | 爬取时间 | #### hive表结构 ```mysql CREATE EXTERNAL TABLE if not exists post( id string, title string, content string, author string, postTime string, url string, tags string, webId string, typeId string, viewNum string, voteNum string, commentNum string, upNum string, downNum string, crowdTimes string, `replies` string, crowdTime string, publish string, updateTime string, updatetimes bigint ) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES('hbase.columns.mapping' = ':key,a:title,a:content,a:author,a:postTime,a:url,a:tags,a:webId,a:typeId,a:viewNum,a:voteNum,a:commentNum,a:upNum,a:downNum,a:crowdTimes,a:replies,a:crowdTime,a:publish,a:updateTime,:timestamp') TBLPROPERTIES('hbase.table.name' = '#:post'); ``` solr表结构 ```mysql add jar /data/soft/hive/lib/Hive2Solr-2.1.1-5.5.3.jar; drop table if exists solr_post; create external table solr_post ( id string, title string, text string, author string, url string, tags string, web_id int, type_id int, vote_num int, view_num int, comment_num int, post_time string ) stored by "com.jazz.hive.store.SolrStorageHandler" tblproperties('solr.url' = 'http://localhost:8983/solr/contentindex', 'solr.query' = '*:*', 'solr.cursor.batch.size'='1000', 'solr.primary_key'='id' , 'is.solrcloud'='0' ); ``` 创建solr索引 ```mysql INSERT OVERWRITE TABLE solr_post select id,title,content as `text`,author,url,if(tags is null,"",concat("_array_",tags)) as tags,webId,typeId,if(voteNum is null,100,voteNum) as voteNum,if(viewNum is null,0,viewNum) as viewNum,if(commentNum is null,0,commentNum) as commentNum,(case when length(postTime) == 10 then from_unixtime(unix_timestamp(postTime,'yyyy-MM-dd'),"yyyy-MM-dd'T'HH:mm:ss'Z'") else from_unixtime(unix_timestamp(postTime,'yyyy-MM-dd HH:mm:ss'),"yyyy-MM-dd'T'HH:mm:ss'Z'") end) as postTime from post where publish != '0' and updateTime >= '2017-02-10' ``` ```shell #查看kafka消费情况 kafka-run-class.sh kafka.tools.ConsumerOffsetChecker --group page_url_crawler_group_1 --topic page_url_1 --zookeeper #:2181 ```
推荐阅读
-
etc杂七杂八的配置文件etc不是什么缩写,是andsoon(等等)的意思来源于法语的etcetera翻译成中文就是等等的意思.至于为什么在etc下面存放配置文件& ...
[详细]
蜡笔小新 2024-09-30 20:08:11
-
这篇文章主要介绍了MySQL中关于datetime、date、time、str之间的转化与比较,具有很好的参考价值,希望对大家有所帮助。如有错误或未考虑完 ...
[详细]
蜡笔小新 2024-09-29 17:03:27
-
-
标签PostgreSQL,Linux,perf,性能诊断,stap,systemtap,strace,dtrace,dwarf,profiler,perf_events,probe ...
[详细]
蜡笔小新 2024-09-29 11:25:52
-
1,HBase的的的的伪分布式配置-对zookeeper的配置,这个前面配置过,修改zoo.cfg文件,指定zookeeper的主入口-配置的HBase的的:进入optmo ...
[详细]
蜡笔小新 2024-09-27 17:38:45
-
shell命令四剑客1.grepUnix中用于文本搜索的工具,它能够接受正则表达式和通配符。也是日常开发调试中用的最多的。用于处理每行的文本grep匹配文本通配符 ...
[详细]
蜡笔小新 2024-09-27 13:40:13
-
篇首语:本文由编程笔记#小编为大家整理,主要介绍了MySQL还能这样玩---第五篇之视图应该这样玩相关的知识,希望对你有一定的参考价值。 ...
[详细]
蜡笔小新 2024-09-29 16:30:33
-
一、使用ContentProvider(内容提供者)共享数据ContentProvider在android中的作用是对外共享数据,也就是说 ...
[详细]
蜡笔小新 2024-09-29 13:49:00
-
蜡笔小新 2024-09-28 18:58:07
-
安装环境:linuxredhatactivemq版本:5.8.01.从http:activemq.apache.orgdownload.html地址下载 ...
[详细]
蜡笔小新 2024-09-28 16:12:46
-
IamusingmaterialDateTimepickerformyAndroidapp.ButIwanttocombinetheDateandTimepic ...
[详细]
蜡笔小新 2024-09-28 10:23:29
-
环境phpstudyphp服务端代码security数据库中的users表中的username,password字段用户名adminJSON服务端代码大家实际测试中注 ...
[详细]
蜡笔小新 2024-09-27 19:45:58
-
post请求,携带json对象参数模拟获取tokenpublicstaticStringgetToken()throwsIOException{创建连接CloseableHttp ...
[详细]
蜡笔小新 2024-09-27 19:18:58
-
我有二进制格式的数据(十六进制:803bc8870a89),我需要将其转换为字符串,以便通过Jackcess在MSAccess数据库中保存二进制数据.我知道,我不认为在Java中使用 ...
[详细]
蜡笔小新 2024-09-27 18:50:34
-
python语言可以用来开发游戏,用于大数据的挖掘和处理,开发web,应用在系统运维,云计算,金融理财分析,人工智能等涉及 ...
[详细]
蜡笔小新 2024-09-27 17:08:55
-
《零入门kubernetes网络实战》视频专栏地址https:www.ixigua.com7193641905282875942本篇文章视频地址(稍后上传)本篇文章主要是想通过g ...
[详细]
蜡笔小新 2024-09-27 16:47:20
-
pokiyo6836
这个家伙很懒,什么也没留下!