ArchivalandAnalytics

作者：ys2011一号_139 | 来源：互联网 | 2018-04-11 02:09

May 16, 2014

By Severalnines

We won’t bore you with buzzwords like volume, velocity and variety. This post is for MySQL users who want to get their hands dirty with Hadoop, so roll up your sleeves and prepare for work. Why would you ever want to move MySQL data into Hadoop? One good reason is archival and analytics. You might not want to delete old data, but rather move it into Hadoop and make it available for further analysis at a later stage.

In this post, we are going to deploy a Hadoop Cluster and export data in bulk from a Galera Cluster usingApache Sqoop. Sqoop is a well-proven approach for bulk data loading from a relational database into Hadoop File System. There is alsoHadoop Applieravailable fromMySQL labs, which works by retrieving INSERT queries from MySQL master binlog and writing them into a file in HDFS in real-time (yes, it applies INSERTs only).

We will useApache Ambarito deploy Hadoop (HDP 2.1) on three servers. We have a clustered Wordpress site running on Galera, and for the purpose of this blog, we will export some user data to Hadoop for archiving purposes. The database name is wordpress, we will use Sqoop to import the data to a Hive table running on HDFS. The following diagram illustrates our setup:

The ClusterControl node has been installed with an HAproxy instance to load balance Galera connections and listen on port 33306.

Prerequisites

All hosts are running CentOS 6.5 with firewall and SElinux turned off. All servers’ time are using NTP server and synced with each other. Hostname must be FQDN or define your hosts across all nodes in /etc/hosts file. Each host has been configured with the following host definitions:

192.168.0.100		clustercontrol haproxy mysql192.168.0.101		mysql1 galera1192.168.0.102		mysql2 galera2192.168.0.103		mysql3 galera3192.168.0.111		hadoop1 hadoop1.cluster.com192.168.0.112		hadoop2 hadoop2.cluster.com192.168.0.113		hadoop3 hadoop3.cluster.com

Create an SSH key and configure passwordless SSH on hadoop1 to other Hadoop nodes to automate the deployment by Ambari Server. In hadoop1, run following commands as root:

$ ssh-keygen -t rsa # press Enter for all prompts$ ssh-copy-id -i ~/.ssh/id_rsa hadoop1.cluster.com$ ssh-copy-id -i ~/.ssh/id_rsa hadoop2.cluster.com$ ssh-copy-id -i ~/.ssh/id_rsa hadoop3.cluster.com

On all Hadoop hosts, install and configure NTP:

$ yum install ntp -y$ chkconfig ntp on$ service ntpd on$ ntpdate -u se.pool.ntp.org

Deploying Hadoop

1. Install Ambari Server on one of the Hadoop nodes (we chose hadoop1.cluster.com), this will help us deploy the Hadoop cluster. Configure Ambari repository for CentOS 6 and start the installation:

$ cd /etc/yum.repos.d$ wget http://public-repo-1.hortonworks.com/ambari/centos6/1.x/updates/1.5.1/ambari.repo$ yum -y install ambari-server

2. Setup and start ambari-server:

$ ambari-server setup # accept all default values on prompt$ ambari-server start

Give Ambari a few minutes to bootstrap before accessing the web interface at port 8080.

3. Open a web browser and navigate to http://hadoop1.cluster.com:8080. Login with username and password ‘admin’. This is the Ambari dashboard, it will guide us through the deployment. Assign a cluster name and clickNext.

4. At theSelect Stackstep, choose HDP2.1:

5. Specify all Hadoop hosts in theTarget Hostsfields. Upload the SSH key that we have generated in the Prerequisites section during passwordless SSH setup and clickRegister and Confirm:

6. This page will confirm that Ambari has located the correct hosts for your Hadoop cluster. Ambri will check those hosts to make sure they have the correct directories, packages, and processes to continue the install. ClickNextto proceed.

7. If you have enough resources, just go ahead and install all services:

8. InAssign Masterpage, we let Ambari choose the configuration for us before clickingNext:

9. InAssign Slaves and Clients page, we’ll enable all clients and slaves on each of our Hadoop hosts:

10. Hive, Oozie and Nagios might requires further input like database password and administrator email. Specify the needed information accordingly and clickNext.

11. You will be able to review your configuration selection before clickingDeploy to start the deployment:

WhenSuccessfully installed and started the servicesappears, chooseNext. On the summary page, chooseComplete. Hadoop installation and deployment is now complete. Verify that all services are running correctly:

We can now proceed to import some data from our Galera cluster as described in the next section.

Importing MySQL Data using Sqoop to Hive

Before importing any MySQL data, we need to create a target table in Hive. This table will have a similar definition as the source table in MySQL as we are importing all columns at the same time. Here is MySQL’s CREATE TABLE statement:

CREATE TABLE `wp_users` (`ID` bigint(20) unsigned NOT NULL AUTO_INCREMENT,`user_login` varchar(60) NOT NULL DEFAULT '',`user_pass` varchar(64) NOT NULL DEFAULT '',`user_nicename` varchar(50) NOT NULL DEFAULT '',`user_email` varchar(100) NOT NULL DEFAULT '',`user_url` varchar(100) NOT NULL DEFAULT '',`user_registered` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',`user_activation_key` varchar(60) NOT NULL DEFAULT '',`user_status` int(11) NOT NULL DEFAULT '0',`display_name` varchar(250) NOT NULL DEFAULT '',PRIMARY KEY (`ID`),KEY `user_login_key` (`user_login`),KEY `user_nicename` (`user_nicename`)) ENGINE=InnoDB AUTO_INCREMENT=5864 DEFAULT CHARSET=utf8

SSH into any Hadoop node (since we installed Hadoop clients on all nodes) and switch to hdfs user:

$ su - hdfs

Enter into Hive console:

$ hive

Create a Hive database and table, similar to our MySQL table (Hive does not support DATETIME data type, so we are going to replace it with TIMESTAMP):

hive> CREATE SCHEMA wordpress;hive> SHOW DATABASES;OKdefaultwordpresshive> USE wordpress;hive> CREATE EXTERNAL TABLE IF NOT EXISTS users (ID BIGINT,user_login VARCHAR(60),user_pass VARCHAR(64),user_nicename VARCHAR(50),user_email VARCHAR(100),user_url VARCHAR(100),user_registered TIMESTAMP,user_activation_key VARCHAR(60),user_status INT,display_name VARCHAR(250));hive> exit;

Now we can start to import the wp_users MySQL table into Hive’s users table, connecting to MySQL nodes through HAproxy (port 33306):

$ sqoop import /--connect jdbc:mysql://192.168.0.100:33306/wordpress /--username=wordpress /--password=password /--table=wp_users /--hive-import /--hive-table=wordpress.users /--target-dir=wp_users_import /--direct

We can track the import progress from the Sqoop output :

..INFO mapreduce.ImportJobBase: Beginning import of wp_users..INFO mapreduce.Job: Job job_1400142750135_0020 completed successfully..OKTime taken: 10.035 secondsLoading data to table wordpress.usersTable wordpress.users stats: [numFiles=5, numRows=0, totalSize=240814, rawDataSize=0]OKTime taken: 3.666 seconds

You should see that an HDFS directory wp_users_import has been created (as specified in --target-dir in the Sqoop command) and we can browse its files using the following commands:

$ hdfs dfs -ls$ hdfs dfs -ls wp_users_import$ hdfs dfs -cat wp_users_import/part-m-00000 | more

Now let’s check our imported data inside Hive:

$ hive -e 'SELECT * FROM wordpress.users LIMIT 10'Logging initialized using configuration in file:/etc/hive/conf.dist/hive-log4j.propertiesOK2	admin	$P$BzaV8cFzeGpBODLqCmWp3uOtc5dVRb.	admin	my@email.com		2014-05-15 12:53:12		0	admin5	SteveJones	$P$BciftXXIPbAhaWuO4bFb4LVUN24qay0	SteveJones	demouser2@54.254.93.50		2014-05-15 12:57:59		0	Steve8	JanetGarrett	$P$BEp8IY1zvvrIdtPzDiU9D/br.FtzFa1	JanetGarrett	demouser3@54.254.93.50		2014-05-15 12:57:59		0	Janet11	AnnWalker	$P$B1wix5Xn/15o06BWyHa.r/cZ0rwUWQ/	AnnWalker	demouser4@54.254.93.50		2014-05-15 12:57:59		0	Ann14	DeborahFields	$P$B5PouJkJdfAucdz9p8NaKtS9WoKJu01	DeborahFields	demouser5@54.254.93.50		2014-05-15 12:57:59		0	Deborah17	ChristopherMitchell	$P$Bi/VWI1W4iP7h9mC0SXd4f.kKWnilH/	ChristopherMitchell	demouser6@54.254.93.50		2014-05-15 12:57:59		0	Christopher20	HenryHolmes	$P$BrPHv/ZHb7IBYzFpKgauBl/2WPZAC81	HenryHolmes	demouser7@54.254.93.50		2014-05-15 12:58:00		0	Henry23	DavidWard	$P$BVYg0SFTihdXwDhushveet4n2Eitxp1	DavidWard	demouser8@54.254.93.50		2014-05-15 12:58:00		0	David26	WilliamMurray	$P$Bc8FmkMadsQZCsW4L5Vo8Xax2ex8we.	WilliamMurray	demouser9@54.254.93.50		2014-05-15 12:58:00		0	William29	KellyHarris	$P$Bc85yvlxvWQ4XxkeAgJRugOqm6S6au.	KellyHarris	demouser10@54.254.93.50		2014-05-15 12:58:00		0	KellyTime taken: 16.282 seconds, Fetched: 10 row(s)

Nice! Now we can see that our data exists both in Galera and Hadoop. You can also use --query option in Sqoop to filter the data that you want to export to Hadoop using an SQL query. This is a basic example of how we can start to leverage Hadoop for archival and analytics. Welcome to big data!

References

Sqoop User Guide (v1.4.2) http://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html
Hortonworks Data Platform Documentation http://docs.hortonworks.com/HDPDocuments/Ambari-1.5.1.0/bk_using_Ambari_book/content/index.html

推荐阅读

go
Hadoop源码解析1Hadoop工程包架构解析

1 Hadoop中各工程包依赖简述 Google的核心竞争技术是它的计算平台。Google的大牛们用了下面5篇文章，介绍了它们的计算设施。 GoogleCluster：ht ... [详细]

蜡笔小新 2023-10-17 13:28:20
js
Java学习笔记之使用反射+泛型构建通用DAO

本文介绍了使用反射和泛型构建通用DAO的方法，通过减少代码冗余度来提高开发效率。通过示例说明了如何使用反射和泛型来实现对不同表的相同操作，从而避免重复编写相似的代码。该方法可以在Java学习中起到较大的帮助作用。 ... [详细]

蜡笔小新 2023-12-11 18:38:04
default
Android 新闻App的本地服务器搭建教程

本文介绍了在开发Android新闻App时，搭建本地服务器的步骤。通过使用XAMPP软件，可以一键式搭建起开发环境，包括Apache、MySQL、PHP、PERL。在本地服务器上新建数据库和表，并设置相应的属性。最后，给出了创建new表的SQL语句。这个教程适合初学者参考。 ... [详细]

蜡笔小新 2023-12-14 17:15:19
go
WinPythonHadoop在Win10上安装教程

本文介绍了在Win10上安装WinPythonHadoop的详细步骤，包括安装Python环境、安装JDK8、安装pyspark、安装Hadoop和Spark、设置环境变量、下载winutils.exe等。同时提醒注意Hadoop版本与pyspark版本的一致性，并建议重启电脑以确保安装成功。 ... [详细]

蜡笔小新 2023-12-14 11:26:56
get
ASP.NET Tips: 获取插入记录的ID的方法详解

本文详细介绍了在ASP.NET中获取插入记录的ID的几种方法，包括使用SCOPE_IDENTITY()和IDENT_CURRENT()函数，以及通过ExecuteReader方法执行SQL语句获取ID的步骤。同时，还提供了使用这些方法的示例代码和注意事项。对于需要获取表中最后一个插入操作所产生的ID或马上使用刚插入的新记录ID的开发者来说，本文提供了一些有用的技巧和建议。 ... [详细]

蜡笔小新 2023-12-13 17:03:18
default
HDFS2.x新特性

一、集群间数据拷贝scp实现两个远程主机之间的文件复制scp-rhello.txtroothadoop103:useratguiguhello.txt推pushscp-rr ... [详细]

蜡笔小新 2023-12-13 13:52:40
js
高质量SQL书写的30条建议

本文提供了30条关于优化SQL的建议，包括避免使用select *，使用具体字段，以及使用limit 1等。这些建议是基于实际开发经验总结出来的，旨在帮助读者优化SQL查询。 ... [详细]

蜡笔小新 2023-12-13 13:24:33
jsp
Java项目管理工具及配置教程推荐

本文介绍了一些Java开发项目管理工具及其配置教程，包括团队协同工具worktil，版本管理工具GitLab，自动化构建工具Jenkins，项目管理工具Maven和Maven私服Nexus，以及Mybatis的安装和代码自动生成工具。提供了相关链接供读者参考。 ... [详细]

蜡笔小新 2023-12-13 06:45:16
js
从一个dialog跳转到另一个dialog情况：

原理：dismiss再弹出，把dialog设为全局对象。if(dialog!null&&dialog.isShowing()&&!(Activity.)isFinishing()) ... [详细]

蜡笔小新 2023-12-11 18:28:33
js
Ubuntu下创建deb安装包及离线安装包制作的方法

本文介绍了在Ubuntu下制作deb安装包及离线安装包的方法，通过备份/var/cache/apt/archives文件夹中的安装包，并建立包列表及依赖信息文件，添加本地源，更新源列表，可以在没有网络的情况下更新系统。同时提供了命令示例和资源下载链接。 ... [详细]

蜡笔小新 2023-12-10 21:32:50
netty
Netty源代码分析服务器端启动ServerBootstrap初始化

本文主要分析了Netty源代码中服务器端启动的过程，包括ServerBootstrap的初始化和相关参数的设置。通过分析NioEventLoopGroup、NioServerSocketChannel、ChannelOption.SO_BACKLOG等关键组件和选项的作用，深入理解Netty服务器端的启动过程。同时，还介绍了LoggingHandler的作用和使用方法，帮助读者更好地理解Netty源代码。 ... [详细]

蜡笔小新 2023-12-10 15:42:28
js
Centos7部署安装zabbix5.0详细步骤及注意事项

本文详细介绍了在Centos7上部署安装zabbix5.0的步骤和注意事项，包括准备工作、获取所需的yum源、关闭防火墙和SELINUX等。提供了一步一步的操作指南，帮助读者顺利完成安装过程。 ... [详细]

蜡笔小新 2023-12-10 09:35:39
uri
GSIOpenSSH PAM_USER 安全绕过漏洞

漏洞名称：GSI-OpenSSHPAM_USER安全绕过漏洞CNNVD编号：CNNVD-201304-097发布时间：2013-04-09 ... [详细]

蜡笔小新 2023-12-10 06:34:54
main
拆点+KM，建图思路看的题解，求解最小权匹配问题

本文介绍了一种求解最小权匹配问题的方法，使用了拆点和KM算法。通过将机器拆成多个点，表示加工的顺序，然后使用KM算法求解最小权匹配，得到最优解。文章给出了具体的代码实现，并提供了一篇题解作为参考。 ... [详细]

蜡笔小新 2023-12-09 09:24:15
get
org.apache.hadoop.hive.ql.plan.ExprNodeColumnDesc.getTypeInfo()方法的使用及代码示例

本文整理了Java中org.apache.hadoop.hive.ql.plan.ExprNodeColumnDesc.getTypeInfo()方法的一些代码示例，展 ... [详细]

蜡笔小新 2023-10-17 21:32:56

ys2011一号_139

这个家伙很懒，什么也没留下！

Tags | 热门标签

RankList | 热门文章