热门标签 | HotTags
当前位置:  开发笔记 > 编程语言 > 正文

[转]RunningHadoopOnUbuntuLinux(SingleNodeCluster)

My digital moleskine H
My digital moleskine
  • HOME
  •  
  • ABOUT ME
  •  
  • CONTACT
  •  
  • BLOG
  •  
  • TUTORIALS
  •  
  • PROJECTS
  •  
  • PUBLICATIONS
  •  
  • PHOTOGRAPHY
  • COOKIE MONSTER FOR XMLHTTPREQUEST
  • RUNNING HADOOP ON UBUNTU LINUX (MULTI-NODE CLUSTER)
  • RUNNING HADOOP ON UBUNTU LINUX (SINGLE-NODE CLUSTER)
  • WRITING AN HADOOP MAPREDUCE PROGRAM IN PYTHON
  • Running Hadoop On Ubuntu Linux (Single-Node Cluster)

    by Michael G. Noll on August 5, 2007 (last updated: October 28, 2011)

    In this tutorial, I will describe how to setup a single-node Hadoop cluster.

    Table of Contents:
    • What we want to do
    • Prerequisites
    • Sun Java 6
    • Adding a dedicated Hadoop system user
    • Configuring SSH
    • Disabling IPv6
    • Alternative
    • Hadoop
    • Installation
    • Update $HOME/.bashrc
    • Excursus: Hadoop Distributed File System (HDFS)
    • Configuration
    • hadoop-env.sh
    • conf/*-site.xml
    • Formatting the HDFS filesystem via the NameNode
    • Starting your single-node cluster
    • Stopping your single-node cluster
    • Running a MapReduce job
    • Download example input data
    • Restart the Hadoop cluster
    • Copy local example data to HDFS
    • Run the MapReduce job
    • Retrieve the job result from HDFS
    • Hadoop Web Interfaces
    • MapReduce Job Tracker Web Interface
    • Task Tracker Web Interface
    • HDFS Name Node Web Interface
    • What’s next?
    • Related Links
    • Change Log
    • Comments (312)


    What we want to do

    In this short tutorial, I will describe the required steps for setting up a single-node Hadoop cluster using the Hadoop Distributed File System (HDFS) on Ubuntu Linux.

    Are you looking for the multi-node cluster tutorial? Just head over there.

    Hadoop is a framework written in Java for running applications on large clusters of commodity hardware and incorporates features similar to those of the Google File System and of MapReduce. HDFS is a highly fault-tolerant distributed file system and like Hadoop designed to be deployed on low-cost hardware. It provides high throughput access to application data and is suitable for applications that have large data sets.

    [转]Running Hadoop On Ubuntu Linux (Single-Node Cluster)

    Cluster of machines running Hadoop at Yahoo! (Source: Yahoo!)

    The main goal of this tutorial is to get a ”simple” Hadoop installation up and running so that you can play around with the software and learn more about it.

    This tutorial has been tested with the following software versions:

    • Ubuntu Linux 10.04 LTS (deprecated: 8.10 LTS, 8.04, 7.10, 7.04)
    • Hadoop 0.20.2, released February 2010 (deprecated: 0.13.x – 0.19.x)

    You can find the time of the last document update at the very bottom of this page.

    Prerequisites

    Sun Java 6

    Hadoop requires a working Java 1.5.x (aka 5.0.x) installation. However, using Java 1.6.x (aka 6.0.x aka 6) is recommended for running Hadoop. For the sake of this tutorial, I will therefore describe the installation of Java 1.6.

    In Ubuntu 10.04 LTS, the package sun-java6-jdk has been dropped from the Multiverse section of the Ubuntu archive. You have to perform the following four steps to install the package.

    1. Add the Canonical Partner Repository to your apt repositories:

    $ sudo add-apt-repository "deb http://archive.canonical.com/ lucid partner"

    2. Update the source list

    $ sudo apt-get update

    3. Install sun-java6-jdk

    $ sudo apt-get install sun-java6-jdk

    4. Select Sun’s Java as the default on your machine.

    $ sudo update-java-alternatives -s java-6-sun

    The full JDK which will be placed in /usr/lib/jvm/java-6-sun (well, this directory is actually a symlink on Ubuntu).

    After installation, make a quick check whether Sun’s JDK is correctly set up:

    user@ubuntu:~# java -version java version "1.6.0_20" Java(TM) SE Runtime Environment (build 1.6.0_20-b02) Java HotSpot(TM) Client VM (build 16.3-b01, mixed mode, sharing)

    Adding a dedicated Hadoop system user

    We will use a dedicated Hadoop user account for running Hadoop. While that’s not required it is recommended because it helps to separate the Hadoop installation from other software applications and user accounts running on the same machine (think: security, permissions, backups, etc).

    $ sudo addgroup hadoop $ sudo adduser --ingroup hadoop hduser

    This will add the user hduser and the group hadoop to your local machine.

    Configuring SSH

    Hadoop requires SSH access to manage its nodes, i.e. remote machines plus your local machine if you want to use Hadoop on it (which is what we want to do in this short tutorial). For our single-node setup of Hadoop, we therefore need to configure SSH access to localhost for the hduser user we created in the previous section.

    I assume that you have SSH up and running on your machine and configured it to allow SSH public key authentication. If not, there are several guides available.

    First, we have to generate an SSH key for the hduser user.

    user@ubuntu:~$ su - hduser hduser@ubuntu:~$ ssh-keygen -t rsa -P "" Generating public/private rsa key pair. Enter file in which to save the key (/home/hduser/.ssh/id_rsa): Created directory '/home/hduser/.ssh'. Your identification has been saved in /home/hduser/.ssh/id_rsa. Your public key has been saved in /home/hduser/.ssh/id_rsa.pub. The key fingerprint is: 9b:82:ea:58:b4:e0:35:d7:ff:19:66:a6:ef:ae:0e:d2 hduser@ubuntu The key's randomart image is: [...snipp...] hduser@ubuntu:~$

    The second line will create an RSA key pair with an empty password. Generally, using an empty password is not recommended, but in this case it is needed to unlock the key without your interaction (you don’t want to enter the passphrase every time Hadoop interacts with its nodes).

    Second, you have to enable SSH access to your local machine with this newly created key.

    hduser@ubuntu:~$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

    The final step is to test the SSH setup by connecting to your local machine with the hduser user. The step is also needed to save your local machine’s host key fingerprint to the hduser user’s known_hosts file. If you have any special SSH configuration for your local machine like a non-standard SSH port, you can define host-specific SSH options in $HOME/.ssh/config (see man ssh_config for more information).

    hduser@ubuntu:~$ ssh localhost The authenticity of host 'localhost (::1)' can't be established. RSA key fingerprint is d7:87:25:47:ae:02:00:eb:1d:75:4f:bb:44:f9:36:26. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added 'localhost' (RSA) to the list of known hosts. Linux ubuntu 2.6.32-22-generic #33-Ubuntu SMP Wed Apr 28 13:27:30 UTC 2010 i686 GNU/Linux Ubuntu 10.04 LTS [...snipp...] hduser@ubuntu:~$

    If the SSH connect should fail, these general tips might help:

    • Enable debugging with ssh -vvv localhost and investigate the error in detail.
    • Check the SSH server configuration in /etc/ssh/sshd_config, in particular the optionsPubkeyAuthentication (which should be set to yes) and AllowUsers (if this option is active, add thehduser user to it). If you made any changes to the SSH server configuration file, you can force a configuration reload with sudo /etc/init.d/ssh reload.

    Disabling IPv6

    One problem with IPv6 on Ubuntu is that using 0.0.0.0 for the various networking-related Hadoop configuration options will result in Hadoop binding to the IPv6 addresses of my Ubuntu box.
    In my case, I realized that there’s no practical point in enabling IPv6 on a box when you are not connected to any IPv6 network. Hence, I simply disabled IPv6 on my Ubuntu machine. Your mileage may vary.

    To disable IPv6 on Ubuntu 10.04 LTS, open /etc/sysctl.conf in the editor of your choice and add the following lines to the end of the file:

    #disable ipv6 net.ipv6.conf.all.disable_ipv6 = 1 net.ipv6.conf.default.disable_ipv6 = 1 net.ipv6.conf.lo.disable_ipv6 = 1

    # sudo sysctl -p  执行这个命令,直接生效

    You have to reboot your machine in order to make the changes take effect.

    You can check whether IPv6 is enabled on your machine with the following command:

    $ cat /proc/sys/net/ipv6/conf/all/disable_ipv6

    A return value of 0 means IPv6 is enabled, a value of 1 means disabled (that’s what we want).

    Alternative

    You can also disable IPv6 only for Hadoop as documented in HADOOP-3437. You can do so by adding the following line to conf/hadoop-env.sh:

    export HADOOP_OPTS=-Djava.net.preferIPv4Stack=true

    Hadoop

    Installation

    You have to download Hadoop from the Apache Download Mirrors and extract the contents of the Hadoop package to a location of your choice. I picked /usr/local/hadoop. Make sure to change the owner of all the files to the hduser user and hadoop group, for example:

    $ cd /usr/local $ sudo tar xzf hadoop-0.20.2.tar.gz $ sudo mv hadoop-0.20.2 hadoop $ sudo chown -R hduser:hadoop hadoop

    (Just to give you the idea, YMMV — personally, I create a symlink from hadoop-0.20.2 to hadoop.)

    Update $HOME/.bashrc

    Add the following lines to the end of the $HOME/.bashrc file of user hduser. If you use a shell other than bash, you should of course update its appropriate configuration files instead of .bashrc.

    # Set Hadoop-related environment variables export HADOOP_HOME=/usr/local/hadoop  # Set JAVA_HOME (we will also configure JAVA_HOME directly for Hadoop later on) export JAVA_HOME=/usr/lib/jvm/java-6-sun  # Some convenient aliases and functions for running Hadoop-related commands unalias fs &> /dev/null alias fs="hadoop fs" unalias hls &> /dev/null alias hls="fs -ls"  # If you have LZO compression enabled in your Hadoop cluster and # compress job outputs with LZOP (not covered in this tutorial): # Conveniently inspect an LZOP compressed file from the command # line; run via: # # $ lzohead /hdfs/path/to/lzop/compressed/file.lzo # # Requires installed 'lzop' command. # lzohead () {     hadoop fs -cat $1 | lzop -dc | head -1000 | less }  # Add Hadoop bin/ directory to PATH export PATH=$PATH:$HADOOP_HOME/bin 

    You can repeat this exercise also for other users who want to use Hadoop.

    Excursus: Hadoop Distributed File System (HDFS)

    From The Hadoop Distributed File System: Architecture and Design:

    The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets. HDFS relaxes a few POSIX requirements to enable streaming access to file system data. HDFS was originally built as infrastructure for the Apache Nutch web search engine project. HDFS is part of the Apache Hadoop project, which is part of the Apache Lucene project.

    The following picture gives an overview of the most important HDFS components.

    [转]Running Hadoop On Ubuntu Linux (Single-Node Cluster)

    HDFS Architecture (source: http://hadoop.apache.org/core/docs/current/hdfs_design.html)

    Configuration

    Our goal in this tutorial is a single-node setup of Hadoop. More information of what we do in this section is available on the Hadoop Wiki.

    hadoop-env.sh

    The only required environment variable we have to configure for Hadoop in this tutorial is JAVA_HOME. Open/conf/hadoop-env.sh in the editor of your choice (if you used the installation path in this tutorial, the full path is/usr/local/hadoop/conf/hadoop-env.sh) and set the JAVA_HOME environment variable to the Sun JDK/JRE 6 directory.

    Change

    # The java implementation to use.  Required. # export JAVA_HOME=/usr/lib/j2sdk1.5-sun

    to

    # The java implementation to use.  Required. export JAVA_HOME=/usr/lib/jvm/java-6-sun

    conf/*-site.xml

    Note: As of Hadoop 0.20.0, the configuration settings previously found in hadoop-site.xml were moved to core-site.xml (hadoop.tmp.dir, fs.default.name), mapred-site.xml (mapred.job.tracker) and hdfs-site.xml (dfs.replication).

    In this section, we will configure the directory where Hadoop will store its data files, the network ports it listens to, etc. Our setup will use Hadoop’s Distributed File System, HDFS, even though our little “cluster” only contains our single local machine.

    You can leave the settings below ”as is” with the exception of the hadoop.tmp.dir variable which you have to change to the directory of your choice. We will use the directory /app/hadoop/tmp in this tutorial. Hadoop’s default configurations use hadoop.tmp.dir as the base temporary directory both for the local file system and HDFS, so don’t be surprised if you see Hadoop creating the specified directory automatically on HDFS at some later point.

    Now we create the directory and set the required ownerships and permissions:

    $ sudo mkdir -p /app/hadoop/tmp $ sudo chown hduser:hadoop /app/hadoop/tmp # ...and if you want to tighten up security, chmod from 755 to 750... $ sudo chmod 750 /app/hadoop/tmp 

    If you forget to set the required ownerships and permissions, you will see a java.io.IOException when you try to format the name node in the next section).

    Add the following snippets between the  ...  tags in the respective configuration XML file.

    In file conf/core-site.xml:

        hadoop.tmp.dir   /app/hadoop/tmp   A base for other temporary directories.      fs.default.name   hdfs://localhost:54310   The name of the default file system.  A URI whose   scheme and authority determine the FileSystem implementation.  The   uri's scheme determines the config property (fs.SCHEME.impl) naming   the FileSystem implementation class.  The uri's authority is used to   determine the host, port, etc. for a filesystem. 

    In file conf/mapred-site.xml:

        mapred.job.tracker   localhost:54311   The host and port that the MapReduce job tracker runs   at.  If "local", then jobs are run in-process as a single map   and reduce task.    

    In file conf/hdfs-site.xml:

        dfs.replication   1   Default block replication.   The actual number of replications can be specified when the file is created.   The default is used if replication is not specified in create time.    

    See Getting Started with Hadoop and the documentation in Hadoop’s API Overview if you have any questions about Hadoop’s configuration options.

    Formatting the HDFS filesystem via the NameNode

    The first step to starting up your Hadoop installation is formatting the Hadoop filesystem which is implemented on top of the local filesystem of your “cluster” (which includes only your local machine if you followed this tutorial). You need to do this the first time you set up a Hadoop cluster.

    Do not format a running Hadoop filesystem as you will lose all the data currently in the cluster (in HDFS).

    To format the filesystem (which simply initializes the directory specified by the dfs.name.dir variable), run the command

    hduser@ubuntu:~$ /usr/local/hadoop/bin/hadoop namenode -format

    The output will look like this:

    hduser@ubuntu:/usr/local/hadoop$ bin/hadoop namenode -format 10/05/08 16:59:56 INFO namenode.NameNode: STARTUP_MSG: /************************************************************ STARTUP_MSG: Starting NameNode STARTUP_MSG:   host = ubuntu/127.0.1.1 STARTUP_MSG:   args = [-format] STARTUP_MSG:   version = 0.20.2 STARTUP_MSG:   build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20 -r 911707; compiled by 'chrisdo' on Fri Feb 19 08:07:34 UTC 2010 ************************************************************/ 10/05/08 16:59:56 INFO namenode.FSNamesystem: fsOwner=hduser,hadoop 10/05/08 16:59:56 INFO namenode.FSNamesystem: supergroup=supergroup 10/05/08 16:59:56 INFO namenode.FSNamesystem: isPermissiOnEnabled=true 10/05/08 16:59:56 INFO common.Storage: Image file of size 96 saved in 0 seconds. 10/05/08 16:59:57 INFO common.Storage: Storage directory .../hadoop-hduser/dfs/name has been successfully formatted. 10/05/08 16:59:57 INFO namenode.NameNode: SHUTDOWN_MSG: /************************************************************ SHUTDOWN_MSG: Shutting down NameNode at ubuntu/127.0.1.1 ************************************************************/ hduser@ubuntu:/usr/local/hadoop$

    Starting your single-node cluster

    Run the command:

    hduser@ubuntu:~$ /usr/local/hadoop/bin/start-all.sh

    This will startup a Namenode, Datanode, Jobtracker and a Tasktracker on your machine.

    The output will look like this:

    hduser@ubuntu:/usr/local/hadoop$ bin/start-all.sh starting namenode, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-namenode-ubuntu.out localhost: starting datanode, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-datanode-ubuntu.out localhost: starting secondarynamenode, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-secondarynamenode-ubuntu.out starting jobtracker, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-jobtracker-ubuntu.out localhost: starting tasktracker, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-tasktracker-ubuntu.out hduser@ubuntu:/usr/local/hadoop$

    A nifty tool for checking whether the expected Hadoop processes are running is jps (part of Sun’s Java since v1.5.0). See also How to debug MapReduce programs.

    hduser@ubuntu:/usr/local/hadoop$ jps 2287 TaskTracker 2149 JobTracker 1938 DataNode 2085 SecondaryNameNode 2349 Jps 1788 NameNode

    You can also check with netstat if Hadoop is listening on the configured ports.

    hduser@ubuntu:~$ sudo netstat -plten | grep java tcp   0  0 0.0.0.0:50070   0.0.0.0:*  LISTEN  1001  9236  2471/java tcp   0  0 0.0.0.0:50010   0.0.0.0:*  LISTEN  1001  9998  2628/java tcp   0  0 0.0.0.0:48159   0.0.0.0:*  LISTEN  1001  8496  2628/java tcp   0  0 0.0.0.0:53121   0.0.0.0:*  LISTEN  1001  9228  2857/java tcp   0  0 127.0.0.1:54310 0.0.0.0:*  LISTEN  1001  8143  2471/java tcp   0  0 127.0.0.1:54311 0.0.0.0:*  LISTEN  1001  9230  2857/java tcp   0  0 0.0.0.0:59305   0.0.0.0:*  LISTEN  1001  8141  2471/java tcp   0  0 0.0.0.0:50060   0.0.0.0:*  LISTEN  1001  9857  3005/java tcp   0  0 0.0.0.0:49900   0.0.0.0:*  LISTEN  1001  9037  2785/java tcp   0  0 0.0.0.0:50030   0.0.0.0:*  LISTEN  1001  9773  2857/java hduser@ubuntu:~$

    If there are any errors, examine the log files in the /logs/ directory.

    Stopping your single-node cluster

    Run the command

    hduser@ubuntu:~$ /usr/local/hadoop/bin/stop-all.sh

    to stop all the daemons running on your machine.

    Exemplary output:

    hduser@ubuntu:/usr/local/hadoop$ bin/stop-all.sh stopping jobtracker localhost: stopping tasktracker stopping namenode localhost: stopping datanode localhost: stopping secondarynamenode hduser@ubuntu:/usr/local/hadoop$

    Running a MapReduce job

    We will now run your first Hadoop MapReduce job. We will use the WordCount example job which reads text files and counts how often words occur. The input is text files and the output is text files, each line of which contains a word and the count of how often it occurred, separated by a tab. More information of what happens behind the scenes is available at the Hadoop Wiki.

    Download example input data

    We will use three ebooks from Project Gutenberg for this example:

    • The Outline of Science, Vol. 1 (of 4) by J. Arthur Thomson
    • The Notebooks of Leonardo Da Vinci
    • Ulysses by James Joyce

    Download each ebook as text files in Plain Text UTF-8 encoding and store the files in a temporary directory of choice, for example /tmp/gutenberg.

    hduser@ubuntu:~$ ls -l /tmp/gutenberg/ total 3604 -rw-r--r-- 1 hduser hadoop  674566 Feb  3 10:17 pg20417.txt -rw-r--r-- 1 hduser hadoop 1573112 Feb  3 10:18 pg4300.txt -rw-r--r-- 1 hduser hadoop 1423801 Feb  3 10:18 pg5000.txt hduser@ubuntu:~$

    Restart the Hadoop cluster

    Restart your Hadoop cluster if it’s not running already.

    hduser@ubuntu:~$ /usr/local/hadoop/bin/start-all.sh

    Copy local example data to HDFS

    Before we run the actual MapReduce job, we first have to copy the files from our local file system to Hadoop’sHDFS.

    hduser@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -copyFromLocal /tmp/gutenberg /user/hduser/gutenberg hduser@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -ls /user/hduser Found 1 items drwxr-xr-x   - hduser supergroup          0 2010-05-08 17:40 /user/hduser/gutenberg hduser@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -ls /user/hduser/gutenberg Found 3 items -rw-r--r--   3 hduser supergroup     674566 2011-03-10 11:38 /user/hduser/gutenberg/pg20417.txt -rw-r--r--   3 hduser supergroup    1573112 2011-03-10 11:38 /user/hduser/gutenberg/pg4300.txt -rw-r--r--   3 hduser supergroup    1423801 2011-03-10 11:38 /user/hduser/gutenberg/pg5000.txt hduser@ubuntu:/usr/local/hadoop$

    Run the MapReduce job

    Now, we actually run the WordCount example job.

    hduser@ubuntu:/usr/local/hadoop$ bin/hadoop jar hadoop*examples*.jar wordcount /user/hduser/gutenberg /user/hduser/gutenberg-output

    This command will read all the files in the HDFS directory /user/hduser/gutenberg, process it, and store the result in the HDFS directory /user/hduser/gutenberg-output.

    Note: Some people run the command above and get the following error message:

    Exception in thread "main" java.io.IOException: Error opening job jar: hadoop*examples*.jar at org.apache.hadoop.util.RunJar.main (RunJar.java: 90) Caused by: java.util.zip.ZipException: error in opening zip file 

    In this case, re-run the command with the full name of the Hadoop Examples JAR file, for example:

    hduser@ubuntu:/usr/local/hadoop$ bin/hadoop jar hadoop-examples-0.20.203.0.jar wordcount /user/hduser/gutenberg /user/hduser/gutenberg-output 

    Exemplary output of the previous command in the console:

    hduser@ubuntu:/usr/local/hadoop$ bin/hadoop jar hadoop*examples*.jar wordcount /user/hduser/gutenberg /user/hduser/gutenberg-output 10/05/08 17:43:00 INFO input.FileInputFormat: Total input paths to process : 3 10/05/08 17:43:01 INFO mapred.JobClient: Running job: job_201005081732_0001 10/05/08 17:43:02 INFO mapred.JobClient:  map 0% reduce 0% 10/05/08 17:43:14 INFO mapred.JobClient:  map 66% reduce 0% 10/05/08 17:43:17 INFO mapred.JobClient:  map 100% reduce 0% 10/05/08 17:43:26 INFO mapred.JobClient:  map 100% reduce 100% 10/05/08 17:43:28 INFO mapred.JobClient: Job complete: job_201005081732_0001 10/05/08 17:43:28 INFO mapred.JobClient: Counters: 17 10/05/08 17:43:28 INFO mapred.JobClient:   Job Counters 10/05/08 17:43:28 INFO mapred.JobClient:     Launched reduce tasks=1 10/05/08 17:43:28 INFO mapred.JobClient:     Launched map tasks=3 10/05/08 17:43:28 INFO mapred.JobClient:     Data-local map tasks=3 10/05/08 17:43:28 INFO mapred.JobClient:   FileSystemCounters 10/05/08 17:43:28 INFO mapred.JobClient:     FILE_BYTES_READ=2214026 10/05/08 17:43:28 INFO mapred.JobClient:     HDFS_BYTES_READ=3639512 10/05/08 17:43:28 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=3687918 10/05/08 17:43:28 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=880330 10/05/08 17:43:28 INFO mapred.JobClient:   Map-Reduce Framework 10/05/08 17:43:28 INFO mapred.JobClient:     Reduce input groups=82290 10/05/08 17:43:28 INFO mapred.JobClient:     Combine output records=102286 10/05/08 17:43:28 INFO mapred.JobClient:     Map input records=77934 10/05/08 17:43:28 INFO mapred.JobClient:     Reduce shuffle bytes=1473796 10/05/08 17:43:28 INFO mapred.JobClient:     Reduce output records=82290 10/05/08 17:43:28 INFO mapred.JobClient:     Spilled Records=255874 10/05/08 17:43:28 INFO mapred.JobClient:     Map output bytes=6076267 10/05/08 17:43:28 INFO mapred.JobClient:     Combine input records=629187 10/05/08 17:43:28 INFO mapred.JobClient:     Map output records=629187 10/05/08 17:43:28 INFO mapred.JobClient:     Reduce input records=102286

    Check if the result is successfully stored in HDFS directory /user/hduser/gutenberg-output:

    hduser@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -ls /user/hduser Found 2 items drwxr-xr-x   - hduser supergroup          0 2010-05-08 17:40 /user/hduser/gutenberg drwxr-xr-x   - hduser supergroup          0 2010-05-08 17:43 /user/hduser/gutenberg-output hduser@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -ls /user/hduser/gutenberg-output Found 2 items drwxr-xr-x   - hduser supergroup          0 2010-05-08 17:43 /user/hduser/gutenberg-output/_logs -rw-r--r--   1 hduser supergroup     880802 2010-05-08 17:43 /user/hduser/gutenberg-output/part-r-00000 hduser@ubuntu:/usr/local/hadoop$

    If you want to modify some Hadoop settings on the fly like increasing the number of Reduce tasks, you can use the "-D" option:

    hduser@ubuntu:/usr/local/hadoop$ bin/hadoop jar hadoop*examples*.jar wordcount -D mapred.reduce.tasks=16 /user/hduser/gutenberg /user/hduser/gutenberg-output

    An important note about mapred.map.tasks: Hadoop does not honor mapred.map.tasks beyond considering it a hint. But it accepts the user specified mapred.reduce.tasks and doesn’t manipulate that. You cannot force mapred.map.tasks but you can specify mapred.reduce.tasks.

    Retrieve the job result from HDFS

    To inspect the file, you can copy it from HDFS to the local file system. Alternatively, you can use the command

    hduser@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -cat /user/hduser/gutenberg-output/part-r-00000

    to read the file directly from HDFS without copying it to the local file system. In this tutorial, we will copy the results to the local file system though.

    hduser@ubuntu:/usr/local/hadoop$ mkdir /tmp/gutenberg-output hduser@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -getmerge /user/hduser/gutenberg-output /tmp/gutenberg-output hduser@ubuntu:/usr/local/hadoop$ head /tmp/gutenberg-output/gutenberg-output "(Lo)cra"       1 "1490   1 "1498," 1 "35"    1 "40,"   1 "A      2 "AS-IS".        1 "A_     1 "Absoluti       1 "Alack! 1 hduser@ubuntu:/usr/local/hadoop$

    Note that in this specific output the quote signs (“) enclosing the words in the head output above have not been inserted by Hadoop. They are the result of the word tokenizer used in the WordCount example, and in this case they matched the beginning of a quote in the ebook texts. Just inspect the part-00000 file further to see it for yourself.

    The command fs -getmerge will simply concatenate any files it finds in the directory you specify. This means that the merged file might (and most likely will) not be sorted.

    Hadoop Web Interfaces

    Hadoop comes with several web interfaces which are by default (see conf/hadoop-default.xml) available at these locations:

    • http://localhost:50030/ – web UI for MapReduce job tracker(s)
    • http://localhost:50060/ – web UI for task tracker(s)
    • http://localhost:50070/ – web UI for HDFS name node(s)

    These web interfaces provide concise information about what’s happening in your Hadoop cluster. You might want to give them a try.

    MapReduce Job Tracker Web Interface

    The job tracker web UI provides information about general job statistics of the Hadoop cluster, running/completed/failed jobs and a job history log file. It also gives access to the ”local machine’s” Hadoop log files (the machine on which the web UI is running on).

    By default, it’s available at http://localhost:50030/.

    [转]Running Hadoop On Ubuntu Linux (Single-Node Cluster)

    A screenshot of Hadoop's Job Tracker web interface.

    Task Tracker Web Interface

    The task tracker web UI shows you running and non-running tasks. It also gives access to the ”local machine’s” Hadoop log files.

    By default, it’s available at http://localhost:50060/.

    [转]Running Hadoop On Ubuntu Linux (Single-Node Cluster)

    A screenshot of Hadoop's Task Tracker web interface.

    HDFS Name Node Web Interface

    The name node web UI shows you a cluster summary including information about total/remaining capacity, live and dead nodes. Additionally, it allows you to browse the HDFS namespace and view the contents of its files in the web browser. It also gives access to the ”local machine’s” Hadoop log files.

    By default, it’s available at http://localhost:50070/.

    [转]Running Hadoop On Ubuntu Linux (Single-Node Cluster)

    A screenshot of Hadoop's Name Node web interface.

    What’s next?

    If you’re feeling comfortable, you can continue your Hadoop experience with my follow-up tutorial Running Hadoop On Ubuntu Linux (Multi-Node Cluster) where I describe how to build a Hadoop ”multi-node” cluster with two Ubuntu boxes (this will increase your current cluster size by 100% [转]Running Hadoop On Ubuntu Linux (Single-Node Cluster) ).

    In addition, I wrote a tutorial on how to code a simple MapReduce job in the Python programming language which can serve as the basis for writing your own MapReduce programs.

    Related Links

    From yours truly:

    • Running Hadoop On Ubuntu Linux (Multi-Node Cluster)
    • Writing An Hadoop MapReduce Program In Python

    From other people:

    • Hadoop home page
    • Project Description @ Hadoop Wiki
    • Getting Started with Hadoop @ Hadoop Wiki
    • How to debug MapReduce programs @ Hadoop Wiki
    • Hadoop API Overview

    Change Log

    Only important changes to this article are listed here:

    • 2011-07-17: Renamed the Hadoop user from hadoop to hduser based on readers’ feedback. This should make the distinction between the local Hadoop user (now hduser), the local Hadoop group (hadoop), and the Hadoop CLI tool (hadoop) more clear.
    Bookmark: Permanent Link
    Filed Under:
    Tweet
    bookmark this on Delicious

    312 Responses to “Running Hadoop On Ubuntu Linux (Single-Node Cluster)”

    « Older Comments
    1. [转]Running Hadoop On Ubuntu Linux (Single-Node Cluster)vijay
      November 29, 2011 at 08:18

      it was nice tutorial.. thank you [转]Running Hadoop On Ubuntu Linux (Single-Node Cluster)

  • [转]Running Hadoop On Ubuntu Linux (Single-Node Cluster)Rahul
    November 29, 2011 at 10:51

    Very useful stuff…Well don..Thank you..:)

  • [转]Running Hadoop On Ubuntu Linux (Single-Node Cluster)Dev
    November 29, 2011 at 13:02

    @Michael : Thank you for reply, now problem is solved.

  • [转]Running Hadoop On Ubuntu Linux (Single-Node Cluster)Praveen Yarlagadda
    November 29, 2011 at 22:25

    Great work Michael! You might want to create a separate page for 0.23 setup. Just a suggestion. It has lot of changes. The “Getting started” page for 0.23 on hadoop site was useless.


  • 推荐阅读
    • 本文介绍了如何利用Struts1框架构建一个简易的四则运算计算器。通过采用DispatchAction来处理不同类型的计算请求,并使用动态Form来优化开发流程,确保代码的简洁性和可维护性。同时,系统提供了用户友好的错误提示,以增强用户体验。 ... [详细]
    • 为了在Hadoop 2.7.2中实现对Snappy压缩和解压功能的原生支持,本文详细介绍了如何重新编译Hadoop源代码,并优化其Native编译过程。通过这一优化,可以显著提升数据处理的效率和性能。此外,还探讨了编译过程中可能遇到的问题及其解决方案,为用户提供了一套完整的操作指南。 ... [详细]
    • 优化后的标题:深入探讨网关安全:将微服务升级为OAuth2资源服务器的最佳实践
      本文深入探讨了如何将微服务升级为OAuth2资源服务器,以订单服务为例,详细介绍了在POM文件中添加 `spring-cloud-starter-oauth2` 依赖,并配置Spring Security以实现对微服务的保护。通过这一过程,不仅增强了系统的安全性,还提高了资源访问的可控性和灵活性。文章还讨论了最佳实践,包括如何配置OAuth2客户端和资源服务器,以及如何处理常见的安全问题和错误。 ... [详细]
    • 在处理遗留数据库的映射时,反向工程是一个重要的初始步骤。由于实体模式已经在数据库系统中存在,Hibernate 提供了自动化工具来简化这一过程,帮助开发人员快速生成持久化类和映射文件。通过反向工程,可以显著提高开发效率并减少手动配置的错误。此外,该工具还支持对现有数据库结构进行分析,自动生成符合 Hibernate 规范的配置文件,从而加速项目的启动和开发周期。 ... [详细]
    • 如何使用 `org.apache.tomcat.websocket.server.WsServerContainer.findMapping()` 方法及其代码示例解析 ... [详细]
    • Web开发框架概览:Java与JavaScript技术及框架综述
      Web开发涉及服务器端和客户端的协同工作。在服务器端,Java是一种优秀的编程语言,适用于构建各种功能模块,如通过Servlet实现特定服务。客户端则主要依赖HTML进行内容展示,同时借助JavaScript增强交互性和动态效果。此外,现代Web开发还广泛使用各种框架和库,如Spring Boot、React和Vue.js,以提高开发效率和应用性能。 ... [详细]
    • 本文深入探讨了如何利用Maven高效管理项目中的外部依赖库。通过介绍Maven的官方依赖搜索地址(),详细讲解了依赖库的添加、版本管理和冲突解决等关键操作。此外,还提供了实用的配置示例和最佳实践,帮助开发者优化项目构建流程,提高开发效率。 ... [详细]
    • 在Ubuntu系统中安装Android SDK的详细步骤及解决“Failed to fetch URL https://dlssl.google.com/”错误的方法
      在Ubuntu 11.10 x64系统中安装Android SDK的详细步骤,包括配置环境变量和解决“Failed to fetch URL https://dlssl.google.com/”错误的方法。本文详细介绍了如何在该系统上顺利安装并配置Android SDK,确保开发环境的稳定性和高效性。此外,还提供了解决网络连接问题的实用技巧,帮助用户克服常见的安装障碍。 ... [详细]
    • 在iOS开发中,基于HTTPS协议的安全网络请求实现至关重要。HTTPS(全称:HyperText Transfer Protocol over Secure Socket Layer)是一种旨在提供安全通信的HTTP扩展,通过SSL/TLS加密技术确保数据传输的安全性和隐私性。本文将详细介绍如何在iOS应用中实现安全的HTTPS网络请求,包括证书验证、SSL握手过程以及常见安全问题的解决方法。 ... [详细]
    • Presto:高效即席查询引擎的深度解析与应用
      本文深入解析了Presto这一高效的即席查询引擎,详细探讨了其架构设计及其优缺点。Presto通过内存到内存的数据处理方式,显著提升了查询性能,相比传统的MapReduce查询,不仅减少了数据传输的延迟,还提高了查询的准确性和效率。然而,Presto在大规模数据处理和容错机制方面仍存在一定的局限性。本文还介绍了Presto在实际应用中的多种场景,展示了其在大数据分析领域的强大潜力。 ... [详细]
    • 在使用SSH框架进行项目开发时,经常会遇到一些常见的问题。例如,在Spring配置文件中配置AOP事务声明后,进行单元测试时可能会出现“No Hibernate Session bound to thread”的错误。本文将详细探讨这一问题的原因,并提供有效的解决方案,帮助开发者顺利解决此类问题。 ... [详细]
    • 本文探讨了资源访问的学习路径与方法,旨在帮助学习者更高效地获取和利用各类资源。通过分析不同资源的特点和应用场景,提出了多种实用的学习策略和技术手段,为学习者提供了系统的指导和建议。 ... [详细]
    • 在对WordPress Duplicator插件0.4.4版本的安全评估中,发现其存在跨站脚本(XSS)攻击漏洞。此漏洞可能被利用进行恶意操作,建议用户及时更新至最新版本以确保系统安全。测试方法仅限于安全研究和教学目的,使用时需自行承担风险。漏洞编号:HTB23162。 ... [详细]
    • 本文探讨了使用JavaScript在不同页面间传递参数的技术方法。具体而言,从a.html页面跳转至b.html时,如何携带参数并使b.html替代当前页面显示,而非新开窗口。文中详细介绍了实现这一功能的代码及注释,帮助开发者更好地理解和应用该技术。 ... [详细]
    • 在 Android 应用开发中,实现全屏模式和无标题栏设计是提升用户体验的重要手段。本文详细介绍了如何通过 Java 代码实现取消标题栏 `this.requestWindowFeature(Window.FEATURE_NO_TITLE)`,并进一步探讨了全屏模式的多种实现方法和最佳实践,帮助开发者打造更加沉浸式和美观的用户界面。 ... [详细]
    author-avatar
    Quan
    这个家伙很懒,什么也没留下!
    PHP1.CN | 中国最专业的PHP中文社区 | DevBox开发工具箱 | json解析格式化 |PHP资讯 | PHP教程 | 数据库技术 | 服务器技术 | 前端开发技术 | PHP框架 | 开发工具 | 在线工具
    Copyright © 1998 - 2020 PHP1.CN. All Rights Reserved | 京公网安备 11010802041100号 | 京ICP备19059560号-4 | PHP1.CN 第一PHP社区 版权所有