热门标签 | HotTags
当前位置:  开发笔记 > 编程语言 > 正文

爬虫逻辑及数据存储

#爬虫逻辑及数据存储####hbase表结构```mysqlcreateitdaan:post,{NAMEa,VERSIONS1},{NAMEb,VERSION
# 爬虫逻辑及数据存储 #### hbase表结构 ```mysql create '#:post',{NAME=>'a',VERSIOnS=>1},{NAME=>'b',VERSIOnS=>1},{SPLITS=>['01','02','03','04','05','06','07','08','09','0a','0b','0c','0d','0e','0f','10','11','12','13','14','15','16','17','18','19','1a','1b','1c','1d','1e','1f','20','21','22','23','24','25','26','27','28','29','2a','2b','2c','2d','2e','2f','30','31','32','33','34','35','36','37','38','39','3a','3b','3c','3d','3e','3f','40','41','42','43','44','45','46','47','48','49','4a','4b','4c','4d','4e','4f','50','51','52','53','54','55','56','57','58','59','5a','5b','5c','5d','5e','5f','60','61','62','63','64','65','66','67','68','69','6a','6b','6c','6d','6e','6f','70','71','72','73','74','75','76','77','78','79','7a','7b','7c','7d','7e','7f','80','81','82','83','84','85','86','87','88','89','8a','8b','8c','8d','8e','8f','90','91','92','93','94','95','96','97','98','99','9a','9b','9c','9d','9e','9f','a0','a1','a2','a3','a4','a5','a6','a7','a8','a9','aa','ab','ac','ad','ae','af','b0','b1','b2','b3','b4','b5','b6','b7','b8','b9','ba','bb','bc','bd','be','bf','c0','c1','c2','c3','c4','c5','c6','c7','c8','c9','ca','cb','cc','cd','ce','cf','d0','d1','d2','d3','d4','d5','d6','d7','d8','d9','da','db','dc','dd','de','df','e0','e1','e2','e3','e4','e5','e6','e7','e8','e9','ea','eb','ec','ed','ee','ef','f0','f1','f2','f3','f4','f5','f6','f7','f8','f9','fa','fb','fc','fd','fe','ff']} ``` #### hbase字段描述 | 列族 | 列名 | 备注 | | ------ | ---------- | -------- | | rowkey | md5(url) | url的md5值 | | a | title | 标题 | | a | content | 内容 | | a | author | 作者 | | a | postTime | 发布时间 | | a | url | 地址 | | a | tags | 标签 | | a | webId | 网站ID | | a | typeId | 网站类型ID | | a | viewNum | 查看次数 | | a | isCrowd | 是否爬取成功 | | a | crowdTimes | 失败次数 | | a | replies | 回复内容 | | a | crowdTime | 爬取时间 | #### hive表结构 ```mysql CREATE EXTERNAL TABLE if not exists post( id string, title string, content string, author string, postTime string, url string, tags string, webId string, typeId string, viewNum string, voteNum string, commentNum string, upNum string, downNum string, crowdTimes string, `replies` string, crowdTime string, publish string, updateTime string, updatetimes bigint ) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES('hbase.columns.mapping' = ':key,a:title,a:content,a:author,a:postTime,a:url,a:tags,a:webId,a:typeId,a:viewNum,a:voteNum,a:commentNum,a:upNum,a:downNum,a:crowdTimes,a:replies,a:crowdTime,a:publish,a:updateTime,:timestamp') TBLPROPERTIES('hbase.table.name' = '#:post'); ``` solr表结构 ```mysql add jar /data/soft/hive/lib/Hive2Solr-2.1.1-5.5.3.jar; drop table if exists solr_post; create external table solr_post ( id string, title string, text string, author string, url string, tags string, web_id int, type_id int, vote_num int, view_num int, comment_num int, post_time string ) stored by "com.jazz.hive.store.SolrStorageHandler" tblproperties('solr.url' = 'http://localhost:8983/solr/contentindex', 'solr.query' = '*:*', 'solr.cursor.batch.size'='1000', 'solr.primary_key'='id' , 'is.solrcloud'='0' ); ``` 创建solr索引 ```mysql INSERT OVERWRITE TABLE solr_post select id,title,content as `text`,author,url,if(tags is null,"",concat("_array_",tags)) as tags,webId,typeId,if(voteNum is null,100,voteNum) as voteNum,if(viewNum is null,0,viewNum) as viewNum,if(commentNum is null,0,commentNum) as commentNum,(case when length(postTime) == 10 then from_unixtime(unix_timestamp(postTime,'yyyy-MM-dd'),"yyyy-MM-dd'T'HH:mm:ss'Z'") else from_unixtime(unix_timestamp(postTime,'yyyy-MM-dd HH:mm:ss'),"yyyy-MM-dd'T'HH:mm:ss'Z'") end) as postTime from post where publish != '0' and updateTime >= '2017-02-10' ``` ```shell #查看kafka消费情况 kafka-run-class.sh kafka.tools.ConsumerOffsetChecker --group page_url_crawler_group_1 --topic page_url_1 --zookeeper #:2181 ```
推荐阅读
author-avatar
pokiyo6836
这个家伙很懒,什么也没留下!
PHP1.CN | 中国最专业的PHP中文社区 | DevBox开发工具箱 | json解析格式化 |PHP资讯 | PHP教程 | 数据库技术 | 服务器技术 | 前端开发技术 | PHP框架 | 开发工具 | 在线工具
Copyright © 1998 - 2020 PHP1.CN. All Rights Reserved | 京公网安备 11010802041100号 | 京ICP备19059560号-4 | PHP1.CN 第一PHP社区 版权所有