JanusgraphSparkyarnclient模式批量导入配置

作者：膈应人的ID | 来源：互联网 | 2023-07-27 18:54

Janusgraph是一个分布式图数据库，继承自titan。Janusgraph的批量导入（bulkload）默认使用spark的local模式运行，不支持yarn-cluster

Janusgraph是一个分布式图数据库，继承自titan。Janusgraph的批量导入（bulkload）默认使用spark的local模式运行，不支持yarn-cluster模式。虽然支持yarn-client模式，但官方没有说明如何配置，配置起来有许多坑。本文将介绍如何配置yarn-client模式的批量导入。
首先介绍基本配置，然后介绍导入批量导入的配置，最后介绍批量导入的优化。

本文所用软件版本：
janusgraph: 0.1.1
hbase: 1.1.2
hadoop: 2.7.1

基本配置

首先从官网下载并解压janusgraph到本地/data/janusgraph/目录。
然后配置图数据库前后端。由于我们用的是es + hbase，所以直接修改/data/janusgraph/conf/janusgraph-hbase-es.properties ：

#重要 gremlin.graph=org.janusgraph.core.JanusGraphFactory #hbase配置 storage.batch-loading=true storage.backend=hbase storage.hostname=c1-nn1.bdp.idc,c1-nn2.bdp.idc,c1-nn3.bdp.idc storage.hbase.ext.hbase.zookeeper.property.clientPort=2181 storage.hbase.table = yisou:test_graph #es配置 index.search.backend=elasticsearch index.search.hostname=10.120.64.69 #es是只安装在本地，此为本机ip。 index.search.elasticsearch.client-Only=true index.search.index-name=yisou_test_graph #默认cache配置 cache.db-cache = true cache.db-cache-clean-wait = 20 cache.db-cache-time = 180000 cache.db-cache-size = 0.5

3.修改/data/janusgraph/lib下的jar包。由于在跑yarn-client批量导入时有guava等jar包冲突，我根据冲突情况对lib下面的jar包作了调整。主要调整了3个jar包：

hbase-client-1.2.4.jar ==> yisou-hbase-1.0-SNAPSHOT.jar
由于lib下的hbase-client-1.2.4.jar用的guava与我们yarn集群的guava版本有冲突，所以我们用了公司内部的去除了guava的hbase-client，即yisou-hbase-1.0-SNAPSHOT.jar 。
如果不替换，报错 &＃8220;Caused by: java.lang.IllegalAccessError: tried to access method com.google.common.base.Stopwatch.()V from class org.apache.hadoop.hbase.zookeeper.MetaTableLocator&＃8221;
spark-assembly-1.6.1-hadoop2.6.0.jar ==> spark-assembly-1.6.2-hadoop2.6.0.jar
lib自带的spark-assembly-1.6.1-hadoop2.6.0.jar也会引起guava冲突，我将其替换成spark-assembly-1.6.2-hadoop2.6.0.jar。
如果不替换，将会报错&＃8221;java.lang.NoSuchMethodError: groovy.lang.MetaClassImpl.hasCustomStaticInvokeMethod()Z&＃8221;
删除 hbase-protocol-1.2.4.jar.
如果不删除，将会报错 &＃8220;com.google.protobuf.ServiceException: java.lang.NoSuchMethodError: org.apache.hadoop.hbase.protobuf.generated.RPCProtos$ConnectionHeader$Builder.setVersionInfo(Lorg/apache/hadoop/hbase/protobuf/generated/RPCProtos$VersionInfo;)Lorg/apache/hadoop/hbase/protobuf/generated/RPCProtos$ConnectionHeader$Builder;&＃8221;

4.配置图中边和节点属性，具体参考官网，本文不展开。

批量导入配置

由于需要与yarn配合，将导入程序放在yarn上执行，所以需要hadoop相关环境配置。需要修改两个配置文件，一个是Janusgraph的启动脚本/data/janusgraph/lib/gremlin.sh, 另一个是hadoop和spark相关的配置/data/janusgraph/conf/hadoop-graph/hadoop-script.properties。

1.复制/data/janusgraph/lib/gremlin.sh, 假定命名为yarn-gremlin.sh。然后增加hadoop的配置到JAVA_OPTIONS和CLASSPATH中。这样能保证hadoop相关配置能被程序读取到，便于正常启动spark在yarn上的任务。

#!/bin/bash export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop export HADOOP_HOME=/usr/local/hadoop-2.7.1 export JAVA_OPTIOnS="$JAVA_OPTIONS -Djava.library.path=$HADOOP_HOME/lib/native" export CLASSPATH=$HADOOP_CONF_DIR #JANUSGRAPH_HOME为用户安装janusgraph的目录/data/janusgraph/ cd $JANUSGRAPH_HOME ./bin/gremlin.sh

2.修改/data/janusgraph/conf/hadoop-graph/hadoop-script.properties
主要根据要导入文件的格式修改inputFormat、指定要导入的hdfs文件路径、parse函数路径以及spark master指定为yarn-client等。

# # Hadoop Graph Configuration # gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph gremlin.hadoop.graphInputFormat=org.apache.tinkerpop.gremlin.hadoop.structure.io.script.ScriptInputFormat gremlin.hadoop.graphOutputFormat=org.apache.tinkerpop.gremlin.hadoop.structure.io.graphson.GraphSONOutputFormat gremlin.hadoop.jarsInDistributedCache=true #导入文件的hdfs路径。也可以在加载该配置文件后指定 gremlin.hadoop.inputLocation=/user/yisou/taotian1/janus/data/fewData.test.dup #解析hdfs文件的parse函数路径。也可以在加载该配置文件后指定 gremlin.hadoop.scriptInputFormat.script=/user/yisou/taotian1/janus/data/conf/vertex_parse.groovy #gremlin.hadoop.outputLocation=output # # SparkGraphComputer with Yarn Configuration # spark.master=yarn-client spark.executor.memory=6g spark.executor.instances=10 spark.executor.cores=2 spark.serializer=org.apache.spark.serializer.KryoSerializer # spark.kryo.registratiOnRequired=true # spark.storage.memoryFraction=0.2 # spark.eventLog.enabled=true # spark.eventLog.dir=/tmp/spark-event-logs # spark.ui.killEnabled=true #cache config gremlin.spark.persistCOntext=true gremlin.spark.graphStorageLevel=MEMORY_AND_DISK #gremlin.spark.persistStorageLevel=DISK_ONLY ##################################### # GiraphGraphComputer Configuration # ##################################### giraph.minWorkers=2 giraph.maxWorkers=3 giraph.useOutOfCoreGraph=true giraph.useOutOfCoreMessages=true mapred.map.child.java.opts=-Xmx1024m mapred.reduce.child.java.opts=-Xmx1024m giraph.numInputThreads=4 giraph.numComputeThreads=4 # giraph.maxPartitiOnsInMemory=1 # giraph.userPartitiOnCount=2执行批量导入

启动命令：

sh /data/janusgraph/lib/yarn-gremlin.sh

批量导入命令：

local_root="/data/janusgraph" hdfs_root="/user/yisou/taotian1/janus" social_graph="${local_root}/conf/janusgraph-hbase-es.properties" graph = GraphFactory.open("${local_root}/conf/hadoop-script.properties") graph.configuration().setProperty("gremlin.hadoop.inputLocation","/user/yisou/taotian1/janus/data/fewData.test.dup") graph.configuration().setProperty("gremlin.hadoop.scriptInputFormat.script", "${hdfs_root}/conf/vertex_parse.groovy") blvp = BulkLoaderVertexProgram.build().writeGraph(social_graph).create(graph) graph.compute(SparkGraphComputer).program(blvp).submit().get()

运行结果：

sh /data/janusgraph/lib/yarn-gremlin.sh \,,,/ (o o) -----oOOo-(3)-oOOo----- plugin activated: janusgraph.imports plugin activated: tinkerpop.server plugin activated: tinkerpop.utilities SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/data2/janusgraph-0.1.1-hadoop2/lib/slf4j-log4j12-1.7.12.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/data2/janusgraph-0.1.1-hadoop2/lib/logback-classic-1.1.2.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/data2/janusgraph-0.1.1-hadoop2/lib/spark-assembly-1.6.2-hadoop2.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/data2/janusgraph-0.1.1-hadoop2/lib/yisou-hbase-1.0-SNAPSHOT.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] 21:22:00,392 INFO HadoopGraph:87 - HADOOP_GREMLIN_LIBS is set to: /data2/janusgraph-0.1.1-hadoop2/lib plugin activated: tinkerpop.hadoop plugin activated: tinkerpop.spark plugin activated: tinkerpop.tinkergraph gremlin> gremlin> local_root="/data2/janusgraph-0.1.1-hadoop2/social" ==>/data2/janusgraph-0.1.1-hadoop2/social gremlin> hdfs_root="/user/yisou/taotian1/janus" ==>/user/yisou/taotian1/janus gremlin> social_graph="${local_root}/conf/janusgraph-hbase-es-social.properties" ==>/data2/janusgraph-0.1.1-hadoop2/social/conf/janusgraph-hbase-es-social.properties gremlin> graph = GraphFactory.open("${local_root}/conf/hadoop-yarn.properties") ==>hadoopgraph[scriptinputformat->graphsonoutputformat] gremlin> graph.configuration().setProperty("gremlin.hadoop.inputLocation","/user/yisou/taotian1/janus/tmp1person/") ==>null gremlin> graph.configuration().setProperty("gremlin.hadoop.scriptInputFormat.script", "${hdfs_root}/person_parse.groovy") ==>null gremlin> blvp = BulkLoaderVertexProgram.build().writeGraph(social_graph).create(graph) ==>BulkLoaderVertexProgram[bulkLoader=IncrementalBulkLoader, vertexIdProperty=bulkLoader.vertex.id, userSuppliedIds=false, keepOriginalIds=true, batchSize=0] gremlin> graph.compute(SparkGraphComputer).program(blvp).submit().get() 21:25:04,666 INFO deprecation:1173 - mapred.reduce.child.java.opts is deprecated. Instead, use mapreduce.reduce.java.opts 21:25:04,667 INFO deprecation:1173 - mapred.map.child.java.opts is deprecated. Instead, use mapreduce.map.java.opts 21:25:04,680 INFO KryoShimServiceLoader:117 - Set KryoShimService provider to org.apache.tinkerpop.gremlin.hadoop.structure.io.HadoopPoolShimService@4cb2918c (class org.apache.tinkerpop.gremlin.hadoop.structure.io.HadoopPoolShimService) because its priority value (0) is the highest available 21:25:04,680 INFO KryoShimServiceLoader:123 - Configuring KryoShimService provider org.apache.tinkerpop.gremlin.hadoop.structure.io.HadoopPoolShimService@4cb2918c with user-provided configuration 21:25:10,479 WARN SparkConf:70 - The configuration key 'spark.yarn.user.classpath.first' has been deprecated as of Spark 1.3 and may be removed in the future. Please use spark.{driver,executor}.userClassPathFirst instead. 21:25:10,505 INFO SparkContext:58 - Running Spark version 1.6.2 21:25:10,524 WARN SparkConf:70 - The configuration key 'spark.yarn.user.classpath.first' has been deprecated as of Spark 1.3 and may be removed in the future. Please use spark.{driver,executor}.userClassPathFirst instead. 21:25:10,564 INFO SecurityManager:58 - Changing view acls to: yisou 21:25:10,565 INFO SecurityManager:58 - Changing modify acls to: yisou 21:25:10,566 INFO SecurityManager:58 - SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(yisou); users with modify permissions: Set(yisou) 21:25:10,833 WARN SparkConf:70 - The configuration key 'spark.yarn.user.classpath.first' has been deprecated as of Spark 1.3 and may be removed in the future. Please use spark.{driver,executor}.userClassPathFirst instead. 21:25:10,835 WARN SparkConf:70 - The configuration key 'spark.yarn.user.classpath.first' has been deprecated as of Spark 1.3 and may be removed in the future. Please use spark.{driver,executor}.userClassPathFirst instead. 21:25:11,035 INFO Utils:58 - Successfully started service 'sparkDriver' on port 36502. 21:25:11,576 INFO Slf4jLogger:80 - Slf4jLogger started 21:25:11,646 INFO Remoting:74 - Starting remoting ............ 21:25:20,736 INFO Client:58 - Submitting application 2727164 to ResourceManager 21:25:20,771 INFO YarnClientImpl:273 - Submitted application application_1466564207556_2727164 21:25:21,780 INFO Client:58 - Application report for application_1466564207556_2727164 (state: ACCEPTED) 21:25:21,785 INFO Client:58 - client token: N/A diagnostics: N/A ApplicationMaster host: N/A ApplicationMaster RPC port: -1 queue: root.yisou start time: 1500297920750 final status: UNDEFINED tracking URL: http://c1-nn3.bdp.idc:8981/proxy/application_1466564207556_2727164/ 21:25:22,787 INFO Client:58 - Application report for application_1466564207556_2727164 (state: ACCEPTED) 21:25:23,789 INFO Client:58 - Application report for application_1466564207556_2727164 (state: ACCEPTED) 21:25:24,791 INFO Client:58 - Application report for application_1466564207556_2727164 (state: ACCEPTED) 21:25:25,793 INFO Client:58 - Application report for application_1466564207556_2727164 (state: ACCEPTED) 21:25:39,585 INFO JettyUtils:58 - Adding filter: org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter 21:25:39,823 INFO Client:58 - Application report for application_1466564207556_2727164 (state: RUNNING) 21:25:39,824 INFO Client:58 - client token: N/A diagnostics: N/A ApplicationMaster host: 10.130.1.50 ApplicationMaster RPC port: 0 queue: root.yisou start time: 1500297920750 final status: UNDEFINED tracking URL: http://c1-nn3.bdp.idc:8981/proxy/application_1466564207556_2727164/ .......... 21:25:42,864 INFO SparkContext:58 - Added JAR /data2/janusgraph-0.1.1-hadoop2/lib/commons-codec-1.7.jar at http://10.130.64.69:38209/jars/commons-codec-1.7.jar with timestamp 1500297942864 21:25:42,866 INFO SparkContext:58 - Added JAR /data2/janusgraph-0.1.1-hadoop2/lib/commons-lang-2.5.jar at http://10.130.64.69:38209/jars/commons-lang-2.5.jar with timestamp 1500297942866 21:25:42,869 INFO SparkContext:58 - Added JAR /data2/janusgraph-0.1.1-hadoop2/lib/commons-collections-3.2.2.jar at http://10.130.64.69:38209/jars/commons-collections-3.2.2.jar with timestamp 1500297942869 21:25:42,872 INFO SparkContext:58 - Added JAR /data2/janusgraph-0.1.1-hadoop2/lib/commons-io-2.3.jar at http://10.130.64.69:38209/jars/commons-io-2.3.jar with timestamp 1500297942872 21:25:42,874 INFO SparkContext:58 - Added JAR /data2/janusgraph-0.1.1-hadoop2/lib/jetty-util-6.1.26.jar at http://10.130.64.69:38209/jars/jetty-util-6.1.26.jar with timestamp 1500297942874 21:25:42,879 INFO SparkContext:58 - Added JAR /data2/janusgraph-0.1.1-hadoop2/lib/htrace-core-3.1.0-incubating.jar at http://10.130.64.69:38209/jars/htrace-core-3.1.0-incubating.jar with timestamp 1 ............ 21:26:14,751 INFO MapOutputTrackerMaster:58 - Size of output statuses for shuffle 2 is 146 bytes 21:26:14,767 INFO TaskSetManager:58 - Finished task 0.0 in stage 6.0 (TID 4) in 40 ms on c1-dn31.bdp.idc (1/1) 21:26:14,767 INFO YarnScheduler:58 - Removed TaskSet 6.0, whose tasks have all completed, from pool 21:26:14,767 INFO DAGScheduler:58 - ResultStage 6 (foreachPartition at SparkExecutor.java:173) finished in 0.042 s 21:26:14,768 INFO DAGScheduler:58 - Job 1 finished: foreachPartition at SparkExecutor.java:173, took 1.776125 s 21:26:14,775 INFO ShuffledRDD:58 - Removing RDD 2 from persistence list 21:26:14,785 INFO BlockManager:58 - Removing RDD 2 ==>result[hadoopgraph[scriptinputformat->graphsonoutputformat],memory[size:0]] gremlin> 21:26:22,515 INFO YarnClientSchedulerBackend:58 - Registered executor NettyRpcEndpointRef(null) (c1-dn9.bdp.idc:60762) with ID 8批量导入性能优化

如果不做优化，janusgraph批量导入的速度非常慢，导入4千万条数据大约需要3.5小时。优化后可降低到1小时.
1.加大ids.block-size和storage.buffer-size参数的大小（在janusgraph-hbase-es.properties中配置）。
ids.block-size=100000000
storage.buffer-size=102400

2.指定hbase初始的region数目（在janusgraph-hbase-es.properties中配置）。
storage.hbase.region-count = 50

3.边和顶点同时导入，而不是顶点和边分成不同的文件，分开导入。格式可参考/data/janusgraph/data/grateful-dead.txt。

总结

本文主要讲解了janusgraph中如何配置yarn-client的方式批量导入节点和边。

分为基本配置和批量导入的配置两部分，基本配置中需要注意janusgraph自带jar包与用户yarn环境中jar包的冲突问题，可替换或者删除相关jar包。

批量导入配置中重点是在gremlin.sh中添加hadoop的相关配置，将hadoop环境配置到JAVA_OPTIONS和CLASSPATH中。

（完）

参考链接

Titan 数据库使用
图数据库Titan在生产环境中的使用全过程+分析
合并顶点和边，批量导入parse函数样例
Yet Another Analytics & Intelligence Communication Series

推荐阅读

java
2018深入java目标计划及学习内容

本文介绍了作者在2018年的深入java目标计划，包括学习计划和工作中要用到的内容。作者计划学习的内容包括kafka、zookeeper、hbase、hdoop、spark、elasticsearch、solr、spring cloud、mysql、mybatis等。其中，作者对jvm的学习有一定了解，并计划通读《jvm》一书。此外，作者还提到了《HotSpot实战》和《高性能MySQL》等书籍。 ... [详细]

蜡笔小新 2023-12-11 20:00:32
java
每天收获一点点Hadoop概述

一、Hadoop来历Hadoop的思想来源于Google在做搜索引擎的时候出现一个很大的问题就是这么多网页我如何才能以最快的速度来搜索到，由于这个问题Google发明 ... [详细]

蜡笔小新 2023-12-14 18:58:01
search
python发送文件传输助手_python 通过 socket 发送文件的实例代码

{moduleinfo:{card_count:[{count_phone:1,count:1}],search_count:[{count_phone:4 ... [详细]

蜡笔小新 2023-10-17 20:20:31
search
什么是大数据lambda架构

一、什么是Lambda架构Lambda架构由Storm的作者[NathanMarz]提出，根据维基百科的定义，Lambda架构的设计是为了在处理大规模数 ... [详细]

蜡笔小新 2023-10-17 16:06:09
java
Java开发实战讲解！字节跳动三场技术面+HR面

二、回顾整理阿里面试题基本就这样了，还有一些零星的问题想不起来了，答案也整理出来了。自我介绍JVM如何加载一个类的过程，双亲委派模型中有 ... [详细]

蜡笔小新 2023-10-15 19:48:25
java
Java工具类库Hutool介绍及功能概述

本文介绍了Java工具类库Hutool，该工具包封装了对文件、流、加密解密、转码、正则、线程、XML等JDK方法的封装，并提供了各种Util工具类。同时，还介绍了Hutool的组件，包括动态代理、布隆过滤、缓存、定时任务等功能。该工具包可以简化Java代码，提高开发效率。 ... [详细]

蜡笔小新 2023-12-14 14:29:36
java
解决Cydia数据库错误：could not open file /var/lib/dpkg/status 的方法

本文介绍了解决iOS系统中Cydia数据库错误的方法。通过使用苹果电脑上的Impactor工具和NewTerm软件，以及ifunbox工具和终端命令，可以解决该问题。具体步骤包括下载所需工具、连接手机到电脑、安装NewTerm、下载ifunbox并注册Dropbox账号、下载并解压lib.zip文件、将lib文件夹拖入Books文件夹中，并将lib文件夹拷贝到/var/目录下。以上方法适用于已经越狱且出现Cydia数据库错误的iPhone手机。 ... [详细]

蜡笔小新 2023-12-13 19:02:44
java
一句话解决高并发的核心原则

本文介绍了解决高并发的核心原则，即将用户访问请求尽量往前推，避免访问CDN、静态服务器、动态服务器、数据库和存储，从而实现高性能、高并发、高可扩展的网站架构。同时提到了Google的成功案例，以及适用于千万级别PV站和亿级PV网站的架构层次。 ... [详细]

蜡笔小新 2023-12-12 10:56:24
object
OpenStack及其构成简介

本文介绍了OpenStack的逻辑概念以及其构成简介，包括了软件开源项目、基础设施资源管理平台、三大核心组件等内容。同时还介绍了Horizon(UI模块)等相关信息。 ... [详细]

蜡笔小新 2023-12-12 06:47:38
match
Oracle存储过程写法小例子及已命名的异常

本文介绍了Oracle存储过程的基本语法和写法示例，同时还介绍了已命名的系统异常的产生原因。 ... [详细]

蜡笔小新 2023-12-11 15:10:15
match
LVS 实现负载均衡的原理

LVS实现负载均衡的原理LVS负载均衡负载均衡集群是LoadBalance集群。是一种将网络上的访问流量分布于各个节点，以降低服务器压力，更好的向客户端 ... [详细]

蜡笔小新 2023-12-10 12:10:22
text
顾客信息表mysql_客户基本信息数据库表

{moduleinfo:{card_count:[{count_phone:1,count:1}],search_count:[{count_phone:4 ... [详细]

蜡笔小新 2023-10-17 23:09:27
process
Oracle RAC数据库实例启动异常问题分析IPC Send timeout

近期，某用户在重启RAC一个节点的数据库实例时，发现启动速度非常慢。同时业务部门反馈连接RAC存活节点的业务也受影响。通过对日志的分析， ... [详细]

蜡笔小新 2023-10-17 20:40:38
cmd
mysql新版本5.7.17的zip包配置

1.官网下载了mysql-5.7.17-win64.zip包，配置遇到很多麻烦，记录一下；2.解压后放到指定的文件夹，修改mysql-5.7.17的配置文件my-default.i ... [详细]

蜡笔小新 2023-10-17 20:01:32
cmd
hadoop基础----hadoop实战(六)-----hadoop管理工具---Cloudera Manager---CDH介绍

我们在之前的文章中已经初步介绍了Cloudera。hadoop基础----hadoop实战(零)-----hadoop的平台版本选择从版本选择这篇文章中我们了解到除了hadoop官方版本外很多 ... [详细]

蜡笔小新 2023-10-16 14:21:13

膈应人的ID

这个家伙很懒，什么也没留下！

Tags | 热门标签

RankList | 热门文章