Build and Install Hadoop 2.7.2 on Windows

刚刚在Windows10 + Visual Studio 2015 环境下配置了Hadoop Single Node Cluster,下面是主要的步骤.

Build Hadoop Core

这部分主要参照Hadoop 2.7.1 for Windows 10 binary build with Visual Studio 2015 (unofficial)

环境准备

D盘下新建目录D:\Hadoop 用来存放所有的Hadoop配置相关文件

安装Java

下载地址: Java SE Development Kit 8u73 Windows x64
安装地址: D:\Hadoop\jdk1.8.0_73
将环境变量 JAVA_HOME 设置为jdk的位置 D:\Hadoop\jdk1.8.0_73

Getting Hadoop sources

下载地址: hadoop-2.7.2-src.tar.gz
解压地址: D:\Hadoop\hadoop-2.7.2-src

安装其他依赖

打开 D:\Hadoop\hadoop-2.7.2-src\BUILDING.txt ,里面列出了其他的Requirements:

  • Maven 3.0 or later
  • ProtocolBuffer 2.5.0
  • CMake 2.6 or newer
  • zlib headers
  • Unix command-line tools from GnuWin32: sh, mkdir, rm, cp, tar, gzip. These tools must be present on your PATH.

Maven

下载地址: apache-maven-3.3.9-bin.tar.gz
解压地址: D:\Hadoop\apache-maven-3.3.9
将 D:\Hadoop\apache-maven-3.3.9\bin 添加到PATH环境变量中

ProtocolBuffer 2.5.0

下载地址: protocol-2.5.0-win32.zip
解压地址: D:\Hadoop\protoc-2.5.0-win32
将 D:\Hadoop\protoc-2.5.0-win32 添加到PATH环境变量中

CMake 3.4.3

下载地址: cmake-3.5.0-rc2-win32-x86.msi
安装时记得勾选添加到PATH环境变量

zlib headers

下载地址: zlib128-dll.zip
解压地址: D:\Hadoop\zlib128-dll
在环境变量中添加ZLIB_HOME,值为D:\Hadoop\zlib128-dll\include

Unix command-line tools from

根据BUILDING.txt,这个tool可以在安装git的时候顺带安装
下载地址: git-2.7.1
安装的时候记得勾选 “Use Git and optional Unix tools from the Windows Command Prompt”

配置MSBuild

将C:\Windows\Microsoft.NET\Framework64\v4.0.30319添加到PATH环境变量中

更新VS Project文件

用Visual Studio 2015打开下面两个solution,右键solution,选择Retarget Projects

  • D:\Hadoop\hadoop-2.7.2-src\hadoop-common-project\hadoop-common\src\main\winutils\winutils.sln
  • D:\Hadoop\hadoop-2.7.2-src\hadoop-common-project\hadoop-common\src\main\native\native.sln

更新编译选项

打开 D:\Hadoop\hadoop-2.7.2-src\hadoop-hdfs-project\hadoop-hdfs\pom.xml ,将下面这一行改为Visual Studio 2015的形式

修改前

1
<condition property="generator" value="Visual Studio 10" else="Visual Studio 10 Win64">

修改后

1
<condition property="generator" value="Visual Studio 10" else="Visual Studio 14 2015 Win64">

Build Package files

启动 Developer Command Prompt for VS2015

在 D:\Hadoop\hadoop-2.7.2-src 下执行下面的命令设置 Platform environment variable

1
set Platform=x64

然后执行如下命令开始build

1
mvn package -Pdist,native-win -DskipTests -Dtar

build过程中如果出现 OutOfMemoryError,通过下面的命令 assign more memory,然后重新build

1
set MAVEN_OPTS=-Xmx512m -XX:MaxPermSize=128m

如果出现jni.h找不到的错误,可以将下面三个文件复制到 C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\include 下

1
2
3
D:\Hadoop\jdk1.8.0_73\include\jni.h
D:\Hadoop\jdk1.8.0_73\include\win32\jawt_md.h
D:\Hadoop\jdk1.8.0_73\include\win32\jni_md.h

Copy Package files

Build成功后,将 D:\Hadoop\hadoop-2.7.2-src\hadoop-dist\target\ 下的hadoop-2.7.2文件夹复制到 D:\Hadoop\hadoop-2.7.2

Start a Single Node (pseudo-distributed) Cluster

接下来就可以参照Build and Install Hadoop 2.x or newer on Windows,配置Single Node Cluster,注意Command Prompt必须具有Admin权限,否则在执行yarn指令时会报错 “A required priviledge is not held by the client”

在IntelliJ IDEA中单机调试Hadoop程序

这部分内容可以参照HOW-TO: COMPILE AND DEBUG HADOOP APPLICATIONS WITH INTELLIJ IDEA IN WIDNOWS OS(64BIT)

注意的地方是WordCount例程中不要忘记下面这一行,否则Class not found

1
job.setJarByClass(WordCount.class);

Intellij IDEA生成Jar包

这部分内容参照Intellij IDEA 搭建Hadoop开发环境

  • 选择菜单File->Project Structure,弹出Project Structure的设置对话框
  • 选择左边的Artifacts后点击上方的“+”按钮
  • 在弹出的框中选择jar->from moduls with dependencies..
  • 选择要启动的类,然后 确定
  • 应用之后,对话框消失。在IDEA选择菜单Build->Build Artifacts,选择Build或者Rebuild后即可生成,生成的jar文件位于工程项目目录的out/artifacts下

要注意的地方是修改args[0]和args[1],将原先WordCount中的下面两行

1
2
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));

改为

FileInputFormat.addInputPath(job, new Path(args[1]));
FileOutputFormat.setOutputPath(job, new Path(args[2]));

原因可以参考这里org.apache.hadoop.mapred.FileAlreadyExistsException

Reference

Thank you for making the world a better place!