热门标签 | HotTags
当前位置:  开发笔记 > 编程语言 > 正文

JanusgraphSparkyarnclient模式批量导入配置

Janusgraph是一个分布式图数据库,继承自titan。Janusgraph的批量导入(bulkload)默认使用spark的local模式运行,不支持yarn-cluster

Janusgraph是一个分布式图数据库,继承自titan。Janusgraph的批量导入(bulkload)默认使用spark的local模式运行,不支持yarn-cluster模式。虽然支持yarn-client模式,但官方没有说明如何配置,配置起来有许多坑。本文将介绍如何配置yarn-client模式的批量导入。
首先介绍基本配置,然后介绍导入批量导入的配置,最后介绍批量导入的优化。

本文所用软件版本:
janusgraph: 0.1.1
hbase: 1.1.2
hadoop: 2.7.1

基本配置
  1. 首先从官网下载并解压janusgraph到本地/data/janusgraph/目录。
  2. 然后配置图数据库前后端。由于我们用的是es + hbase, 所以直接修改/data/janusgraph/conf/janusgraph-hbase-es.properties :

#重要
gremlin.graph=org.janusgraph.core.JanusGraphFactory
#hbase配置
storage.batch-loading=true
storage.backend=hbase
storage.hostname=c1-nn1.bdp.idc,c1-nn2.bdp.idc,c1-nn3.bdp.idc
storage.hbase.ext.hbase.zookeeper.property.clientPort=2181
storage.hbase.table = yisou:test_graph
#es配置
index.search.backend=elasticsearch
index.search.hostname=10.120.64.69 #es是只安装在本地,此为本机ip。
index.search.elasticsearch.client-Only=true
index.search.index-name=yisou_test_graph
#默认cache配置
cache.db-cache = true
cache.db-cache-clean-wait = 20
cache.db-cache-time = 180000
cache.db-cache-size = 0.5

3.修改/data/janusgraph/lib下的jar包。由于在跑yarn-client批量导入时有guava等jar包冲突,我根据冲突情况对lib下面的jar包作了调整。主要调整了3个jar包:

  1. hbase-client-1.2.4.jar ==> yisou-hbase-1.0-SNAPSHOT.jar
    由于lib下的hbase-client-1.2.4.jar用的guava与我们yarn集群的guava版本有冲突,所以我们用了公司内部的去除了guava的hbase-client,即yisou-hbase-1.0-SNAPSHOT.jar 。
    如果不替换,报错 “Caused by: java.lang.IllegalAccessError: tried to access method com.google.common.base.Stopwatch.()V from class org.apache.hadoop.hbase.zookeeper.MetaTableLocator”
  2. spark-assembly-1.6.1-hadoop2.6.0.jar ==> spark-assembly-1.6.2-hadoop2.6.0.jar
    lib自带的spark-assembly-1.6.1-hadoop2.6.0.jar也会引起guava冲突,我将其替换成spark-assembly-1.6.2-hadoop2.6.0.jar。
    如果不替换,将会报错”java.lang.NoSuchMethodError: groovy.lang.MetaClassImpl.hasCustomStaticInvokeMethod()Z”
  3. 删除 hbase-protocol-1.2.4.jar.
    如果不删除,将会报错 “com.google.protobuf.ServiceException: java.lang.NoSuchMethodError: org.apache.hadoop.hbase.protobuf.generated.RPCProtos$ConnectionHeader$Builder.setVersionInfo(Lorg/apache/hadoop/hbase/protobuf/generated/RPCProtos$VersionInfo;)Lorg/apache/hadoop/hbase/protobuf/generated/RPCProtos$ConnectionHeader$Builder;”

4.配置图中边和节点属性,具体参考官网,本文不展开。

批量导入配置

由于需要与yarn配合,将导入程序放在yarn上执行,所以需要hadoop相关环境配置。需要修改两个配置文件,一个是Janusgraph的启动脚本/data/janusgraph/lib/gremlin.sh, 另一个是hadoop和spark相关的配置/data/janusgraph/conf/hadoop-graph/hadoop-script.properties。

1.复制/data/janusgraph/lib/gremlin.sh, 假定命名为yarn-gremlin.sh。 然后增加hadoop的配置到JAVA_OPTIONS和CLASSPATH中。这样能保证hadoop相关配置能被程序读取到,便于正常启动spark在yarn上的任务。

#!/bin/bash
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export HADOOP_HOME=/usr/local/hadoop-2.7.1
export JAVA_OPTIOnS="$JAVA_OPTIONS -Djava.library.path=$HADOOP_HOME/lib/native"
export CLASSPATH=$HADOOP_CONF_DIR
#JANUSGRAPH_HOME为用户安装janusgraph的目录/data/janusgraph/
cd $JANUSGRAPH_HOME
./bin/gremlin.sh

2.修改/data/janusgraph/conf/hadoop-graph/hadoop-script.properties
主要根据要导入文件的格式修改inputFormat、指定要导入的hdfs文件路径、parse函数路径以及spark master指定为yarn-client等。

#
# Hadoop Graph Configuration
#
gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph
gremlin.hadoop.graphInputFormat=org.apache.tinkerpop.gremlin.hadoop.structure.io.script.ScriptInputFormat
gremlin.hadoop.graphOutputFormat=org.apache.tinkerpop.gremlin.hadoop.structure.io.graphson.GraphSONOutputFormat
gremlin.hadoop.jarsInDistributedCache=true
#导入文件的hdfs路径。也可以在加载该配置文件后指定
gremlin.hadoop.inputLocation=/user/yisou/taotian1/janus/data/fewData.test.dup
#解析hdfs文件的parse函数路径。也可以在加载该配置文件后指定
gremlin.hadoop.scriptInputFormat.script=/user/yisou/taotian1/janus/data/conf/vertex_parse.groovy
#gremlin.hadoop.outputLocation=output
#
# SparkGraphComputer with Yarn Configuration
#
spark.master=yarn-client
spark.executor.memory=6g
spark.executor.instances=10
spark.executor.cores=2
spark.serializer=org.apache.spark.serializer.KryoSerializer
# spark.kryo.registratiOnRequired=true
# spark.storage.memoryFraction=0.2
# spark.eventLog.enabled=true
# spark.eventLog.dir=/tmp/spark-event-logs
# spark.ui.killEnabled=true
#cache config
gremlin.spark.persistCOntext=true
gremlin.spark.graphStorageLevel=MEMORY_AND_DISK
#gremlin.spark.persistStorageLevel=DISK_ONLY
#####################################
# GiraphGraphComputer Configuration #
#####################################
giraph.minWorkers=2
giraph.maxWorkers=3
giraph.useOutOfCoreGraph=true
giraph.useOutOfCoreMessages=true
mapred.map.child.java.opts=-Xmx1024m
mapred.reduce.child.java.opts=-Xmx1024m
giraph.numInputThreads=4
giraph.numComputeThreads=4
# giraph.maxPartitiOnsInMemory=1
# giraph.userPartitiOnCount=2
执行批量导入

启动命令:

sh /data/janusgraph/lib/yarn-gremlin.sh

批量导入命令:

local_root="/data/janusgraph"
hdfs_root="/user/yisou/taotian1/janus"
social_graph="${local_root}/conf/janusgraph-hbase-es.properties"
graph = GraphFactory.open("${local_root}/conf/hadoop-script.properties")
graph.configuration().setProperty("gremlin.hadoop.inputLocation","/user/yisou/taotian1/janus/data/fewData.test.dup")
graph.configuration().setProperty("gremlin.hadoop.scriptInputFormat.script", "${hdfs_root}/conf/vertex_parse.groovy")
blvp = BulkLoaderVertexProgram.build().writeGraph(social_graph).create(graph)
graph.compute(SparkGraphComputer).program(blvp).submit().get()

运行结果:

sh /data/janusgraph/lib/yarn-gremlin.sh
\,,,/
(o o)
-----oOOo-(3)-oOOo-----
plugin activated: janusgraph.imports
plugin activated: tinkerpop.server
plugin activated: tinkerpop.utilities
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/data2/janusgraph-0.1.1-hadoop2/lib/slf4j-log4j12-1.7.12.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/data2/janusgraph-0.1.1-hadoop2/lib/logback-classic-1.1.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/data2/janusgraph-0.1.1-hadoop2/lib/spark-assembly-1.6.2-hadoop2.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/data2/janusgraph-0.1.1-hadoop2/lib/yisou-hbase-1.0-SNAPSHOT.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
21:22:00,392 INFO HadoopGraph:87 - HADOOP_GREMLIN_LIBS is set to: /data2/janusgraph-0.1.1-hadoop2/lib
plugin activated: tinkerpop.hadoop
plugin activated: tinkerpop.spark
plugin activated: tinkerpop.tinkergraph
gremlin>
gremlin> local_root="/data2/janusgraph-0.1.1-hadoop2/social"
==>/data2/janusgraph-0.1.1-hadoop2/social
gremlin> hdfs_root="/user/yisou/taotian1/janus"
==>/user/yisou/taotian1/janus
gremlin> social_graph="${local_root}/conf/janusgraph-hbase-es-social.properties"
==>/data2/janusgraph-0.1.1-hadoop2/social/conf/janusgraph-hbase-es-social.properties
gremlin> graph = GraphFactory.open("${local_root}/conf/hadoop-yarn.properties")
==>hadoopgraph[scriptinputformat->graphsonoutputformat]
gremlin> graph.configuration().setProperty("gremlin.hadoop.inputLocation","/user/yisou/taotian1/janus/tmp1person/")
==>null
gremlin> graph.configuration().setProperty("gremlin.hadoop.scriptInputFormat.script", "${hdfs_root}/person_parse.groovy")
==>null
gremlin> blvp = BulkLoaderVertexProgram.build().writeGraph(social_graph).create(graph)
==>BulkLoaderVertexProgram[bulkLoader=IncrementalBulkLoader, vertexIdProperty=bulkLoader.vertex.id, userSuppliedIds=false, keepOriginalIds=true, batchSize=0]
gremlin> graph.compute(SparkGraphComputer).program(blvp).submit().get()
21:25:04,666 INFO deprecation:1173 - mapred.reduce.child.java.opts is deprecated. Instead, use mapreduce.reduce.java.opts
21:25:04,667 INFO deprecation:1173 - mapred.map.child.java.opts is deprecated. Instead, use mapreduce.map.java.opts
21:25:04,680 INFO KryoShimServiceLoader:117 - Set KryoShimService provider to org.apache.tinkerpop.gremlin.hadoop.structure.io.HadoopPoolShimService@4cb2918c (class org.apache.tinkerpop.gremlin.hadoop.structure.io.HadoopPoolShimService) because its priority value (0) is the highest available
21:25:04,680 INFO KryoShimServiceLoader:123 - Configuring KryoShimService provider org.apache.tinkerpop.gremlin.hadoop.structure.io.HadoopPoolShimService@4cb2918c with user-provided configuration
21:25:10,479 WARN SparkConf:70 - The configuration key 'spark.yarn.user.classpath.first' has been deprecated as of Spark 1.3 and may be removed in the future. Please use spark.{driver,executor}.userClassPathFirst instead.
21:25:10,505 INFO SparkContext:58 - Running Spark version 1.6.2
21:25:10,524 WARN SparkConf:70 - The configuration key 'spark.yarn.user.classpath.first' has been deprecated as of Spark 1.3 and may be removed in the future. Please use spark.{driver,executor}.userClassPathFirst instead.
21:25:10,564 INFO SecurityManager:58 - Changing view acls to: yisou
21:25:10,565 INFO SecurityManager:58 - Changing modify acls to: yisou
21:25:10,566 INFO SecurityManager:58 - SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(yisou); users with modify permissions: Set(yisou)
21:25:10,833 WARN SparkConf:70 - The configuration key 'spark.yarn.user.classpath.first' has been deprecated as of Spark 1.3 and may be removed in the future. Please use spark.{driver,executor}.userClassPathFirst instead.
21:25:10,835 WARN SparkConf:70 - The configuration key 'spark.yarn.user.classpath.first' has been deprecated as of Spark 1.3 and may be removed in the future. Please use spark.{driver,executor}.userClassPathFirst instead.
21:25:11,035 INFO Utils:58 - Successfully started service 'sparkDriver' on port 36502.
21:25:11,576 INFO Slf4jLogger:80 - Slf4jLogger started
21:25:11,646 INFO Remoting:74 - Starting remoting
............
21:25:20,736 INFO Client:58 - Submitting application 2727164 to ResourceManager
21:25:20,771 INFO YarnClientImpl:273 - Submitted application application_1466564207556_2727164
21:25:21,780 INFO Client:58 - Application report for application_1466564207556_2727164 (state: ACCEPTED)
21:25:21,785 INFO Client:58 -
client token: N/A
diagnostics: N/A
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: root.yisou
start time: 1500297920750
final status: UNDEFINED
tracking URL: http://c1-nn3.bdp.idc:8981/proxy/application_1466564207556_2727164/
21:25:22,787 INFO Client:58 - Application report for application_1466564207556_2727164 (state: ACCEPTED)
21:25:23,789 INFO Client:58 - Application report for application_1466564207556_2727164 (state: ACCEPTED)
21:25:24,791 INFO Client:58 - Application report for application_1466564207556_2727164 (state: ACCEPTED)
21:25:25,793 INFO Client:58 - Application report for application_1466564207556_2727164 (state: ACCEPTED)
21:25:39,585 INFO JettyUtils:58 - Adding filter: org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
21:25:39,823 INFO Client:58 - Application report for application_1466564207556_2727164 (state: RUNNING)
21:25:39,824 INFO Client:58 -
client token: N/A
diagnostics: N/A
ApplicationMaster host: 10.130.1.50
ApplicationMaster RPC port: 0
queue: root.yisou
start time: 1500297920750
final status: UNDEFINED
tracking URL: http://c1-nn3.bdp.idc:8981/proxy/application_1466564207556_2727164/
..........
21:25:42,864 INFO SparkContext:58 - Added JAR /data2/janusgraph-0.1.1-hadoop2/lib/commons-codec-1.7.jar at http://10.130.64.69:38209/jars/commons-codec-1.7.jar with timestamp 1500297942864
21:25:42,866 INFO SparkContext:58 - Added JAR /data2/janusgraph-0.1.1-hadoop2/lib/commons-lang-2.5.jar at http://10.130.64.69:38209/jars/commons-lang-2.5.jar with timestamp 1500297942866
21:25:42,869 INFO SparkContext:58 - Added JAR /data2/janusgraph-0.1.1-hadoop2/lib/commons-collections-3.2.2.jar at http://10.130.64.69:38209/jars/commons-collections-3.2.2.jar with timestamp 1500297942869
21:25:42,872 INFO SparkContext:58 - Added JAR /data2/janusgraph-0.1.1-hadoop2/lib/commons-io-2.3.jar at http://10.130.64.69:38209/jars/commons-io-2.3.jar with timestamp 1500297942872
21:25:42,874 INFO SparkContext:58 - Added JAR /data2/janusgraph-0.1.1-hadoop2/lib/jetty-util-6.1.26.jar at http://10.130.64.69:38209/jars/jetty-util-6.1.26.jar with timestamp 1500297942874
21:25:42,879 INFO SparkContext:58 - Added JAR /data2/janusgraph-0.1.1-hadoop2/lib/htrace-core-3.1.0-incubating.jar at http://10.130.64.69:38209/jars/htrace-core-3.1.0-incubating.jar with timestamp 1
............
21:26:14,751 INFO MapOutputTrackerMaster:58 - Size of output statuses for shuffle 2 is 146 bytes
21:26:14,767 INFO TaskSetManager:58 - Finished task 0.0 in stage 6.0 (TID 4) in 40 ms on c1-dn31.bdp.idc (1/1)
21:26:14,767 INFO YarnScheduler:58 - Removed TaskSet 6.0, whose tasks have all completed, from pool
21:26:14,767 INFO DAGScheduler:58 - ResultStage 6 (foreachPartition at SparkExecutor.java:173) finished in 0.042 s
21:26:14,768 INFO DAGScheduler:58 - Job 1 finished: foreachPartition at SparkExecutor.java:173, took 1.776125 s
21:26:14,775 INFO ShuffledRDD:58 - Removing RDD 2 from persistence list
21:26:14,785 INFO BlockManager:58 - Removing RDD 2
==>result[hadoopgraph[scriptinputformat->graphsonoutputformat],memory[size:0]]
gremlin> 21:26:22,515 INFO YarnClientSchedulerBackend:58 - Registered executor NettyRpcEndpointRef(null) (c1-dn9.bdp.idc:60762) with ID 8
批量导入性能优化

如果不做优化,janusgraph批量导入的速度非常慢,导入4千万条数据大约需要3.5小时。优化后可降低到1小时.
1.加大ids.block-size和storage.buffer-size参数的大小(在janusgraph-hbase-es.properties中配置)。
ids.block-size=100000000
storage.buffer-size=102400

2.指定hbase初始的region数目(在janusgraph-hbase-es.properties中配置)。
storage.hbase.region-count = 50

3.边和顶点同时导入,而不是顶点和边分成不同的文件,分开导入。格式可参考/data/janusgraph/data/grateful-dead.txt。

总结

本文主要讲解了janusgraph中如何配置yarn-client的方式批量导入节点和边。

分为基本配置和批量导入的配置两部分,基本配置中需要注意janusgraph自带jar包与用户yarn环境中jar包的冲突问题,可替换或者删除相关jar包。

批量导入配置中重点是在gremlin.sh中添加hadoop的相关配置,将hadoop环境配置到JAVA_OPTIONS和CLASSPATH中。

(完)

参考链接

Titan 数据库使用
图数据库Titan在生产环境中的使用全过程+分析
合并顶点和边,批量导入parse函数样例
Yet Another Analytics & Intelligence Communication Series


推荐阅读
  • 2018深入java目标计划及学习内容
    本文介绍了作者在2018年的深入java目标计划,包括学习计划和工作中要用到的内容。作者计划学习的内容包括kafka、zookeeper、hbase、hdoop、spark、elasticsearch、solr、spring cloud、mysql、mybatis等。其中,作者对jvm的学习有一定了解,并计划通读《jvm》一书。此外,作者还提到了《HotSpot实战》和《高性能MySQL》等书籍。 ... [详细]
  • 一、Hadoop来历Hadoop的思想来源于Google在做搜索引擎的时候出现一个很大的问题就是这么多网页我如何才能以最快的速度来搜索到,由于这个问题Google发明 ... [详细]
  • {moduleinfo:{card_count:[{count_phone:1,count:1}],search_count:[{count_phone:4 ... [详细]
  • 什么是大数据lambda架构
    一、什么是Lambda架构Lambda架构由Storm的作者[NathanMarz]提出,根据维基百科的定义,Lambda架构的设计是为了在处理大规模数 ... [详细]
  • Java开发实战讲解!字节跳动三场技术面+HR面
    二、回顾整理阿里面试题基本就这样了,还有一些零星的问题想不起来了,答案也整理出来了。自我介绍JVM如何加载一个类的过程,双亲委派模型中有 ... [详细]
  • 本文介绍了Java工具类库Hutool,该工具包封装了对文件、流、加密解密、转码、正则、线程、XML等JDK方法的封装,并提供了各种Util工具类。同时,还介绍了Hutool的组件,包括动态代理、布隆过滤、缓存、定时任务等功能。该工具包可以简化Java代码,提高开发效率。 ... [详细]
  • 解决Cydia数据库错误:could not open file /var/lib/dpkg/status 的方法
    本文介绍了解决iOS系统中Cydia数据库错误的方法。通过使用苹果电脑上的Impactor工具和NewTerm软件,以及ifunbox工具和终端命令,可以解决该问题。具体步骤包括下载所需工具、连接手机到电脑、安装NewTerm、下载ifunbox并注册Dropbox账号、下载并解压lib.zip文件、将lib文件夹拖入Books文件夹中,并将lib文件夹拷贝到/var/目录下。以上方法适用于已经越狱且出现Cydia数据库错误的iPhone手机。 ... [详细]
  • 一句话解决高并发的核心原则
    本文介绍了解决高并发的核心原则,即将用户访问请求尽量往前推,避免访问CDN、静态服务器、动态服务器、数据库和存储,从而实现高性能、高并发、高可扩展的网站架构。同时提到了Google的成功案例,以及适用于千万级别PV站和亿级PV网站的架构层次。 ... [详细]
  • 本文介绍了OpenStack的逻辑概念以及其构成简介,包括了软件开源项目、基础设施资源管理平台、三大核心组件等内容。同时还介绍了Horizon(UI模块)等相关信息。 ... [详细]
  • 本文介绍了Oracle存储过程的基本语法和写法示例,同时还介绍了已命名的系统异常的产生原因。 ... [详细]
  • LVS实现负载均衡的原理LVS负载均衡负载均衡集群是LoadBalance集群。是一种将网络上的访问流量分布于各个节点,以降低服务器压力,更好的向客户端 ... [详细]
  • {moduleinfo:{card_count:[{count_phone:1,count:1}],search_count:[{count_phone:4 ... [详细]
  • 近期,某用户在重启RAC一个节点的数据库实例时,发现启动速度非常慢。同时业务部门反馈连接RAC存活节点的业务也受影响。通过对日志的分析, ... [详细]
  • 1.官网下载了mysql-5.7.17-win64.zip包,配置遇到很多麻烦,记录一下;2.解压后放到指定的文件夹,修改mysql-5.7.17的配置文件my-default.i ... [详细]
  • 我们在之前的文章中已经初步介绍了Cloudera。hadoop基础----hadoop实战(零)-----hadoop的平台版本选择从版本选择这篇文章中我们了解到除了hadoop官方版本外很多 ... [详细]
author-avatar
膈应人的ID
这个家伙很懒,什么也没留下!
PHP1.CN | 中国最专业的PHP中文社区 | DevBox开发工具箱 | json解析格式化 |PHP资讯 | PHP教程 | 数据库技术 | 服务器技术 | 前端开发技术 | PHP框架 | 开发工具 | 在线工具
Copyright © 1998 - 2020 PHP1.CN. All Rights Reserved | 京公网安备 11010802041100号 | 京ICP备19059560号-4 | PHP1.CN 第一PHP社区 版权所有