Build Nutch 1.4 cluster with Hadoop

The current released version of Apache Nutch is 1.4. Since Nutch 1.3, there was no Hadoop distribution integrated with Nutch’s release package. So I have to build a Hadoop cluster seperately first, and then configure Nutch 1.4 work with Hadoop. My server OS is ubuntu 10.04 LTS, I have two server names cluster1 and cluster2. I’ll note the steps here.

Preparation

First of all, download Apache Nutch 1.4 from

http://nutch.apache.org/

and Hadoop 0.20.2 from

http://hadoop.apache.org/common/releases.html#Download.

Create user “nutch” ont both cluster1 and cluster2:

$ sudo adduser nutch

Input the password for user “nutch”, after that  configure the server to use ssh to access slaves from master without password. We use cluster1 as master and cluster2 as slave. I will configure the work environment on cluster1 and then copy them to cluster2. Now configure the SSH first, I assume that the OpenSSH and JDK has been installed on the servers. Login to cluster1 with user “nutch” and create a SSH key, then copy the public key to cluster2:

$ ssh-keygen -t rsa -P “”

$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

$ scp ~/.ssh/authorized_keys nutch@cluster2:/home/nutch/.ssh/authorized_keys

Now edit /etc/hosts file on servers, add two lines like:

x.x.x.x    cluster1
y.y.y.y    cluster2

Install Hadoop 0.20.2

I configured Hadoop following this two post:

http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/

http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/

The difference was I put hadoop in my /home/nutch/hadoop folder and my conf/master file’s content:

cluster1

My conf/slaves file’s content:

cluster1

cluster2

Now change the contents of conf/core-site.xml, conf/hdfs-site.xml and conf/mapred-site.xml

conf/core-site.xml

<?xml version=”1.0″?>
<?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?>

<configuration>

<property>
<name>hadoop.tmp.dir</name>
<value>/home/nutch/hadoop/tmp</value>
</property>

<property>
<name>fs.default.name</name>
<value>hdfs://cluster1:9000</value>
</property>
</configuration>

conf/hdfs-site.xml

<?xml version=”1.0″?>
<?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?>

<configuration>

<property>
<name>dfs.name.dir</name>
<value>/home/nutch/filesystem/name</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/home/nutch/filesystem/data</value>
</property>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
</configuration>

conf/mapred-site.xml

<?xml version=”1.0″?>
<?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?>

<!– Put site-specific property overrides in this file. –>

<configuration>
<property>
<name>mapred.job.tracker</name>
<value>cluster1:9001</value>
</property>
<property>
<name>mapred.map.tasks</name>
<value>2</value>
</property>
<property>
<name>mapred.reduce.tasks</name>
<value>2</value>
</property>
<property>
<name>mapred.system.dir</name>
<value>/home/nutch/filesystem/mapreduce/system</value>
</property>
<property>
<name>mapred.local.dir</name>
<value>/home/nutch/filesystem/mapreduce/local</value>
</property>
</configuration>

After configuring hadoop on cluster1, I copied the hadoop folder to cluster2:

$ scp -r ~/hadoop nutch@cluster2:/home/nutch/

Install Nutch 1.4

Unpack nutch 1.4 to /home/nutch/nutch:

$ tar xvf apache-nutch-1.4-bin.tar.gz -C ~/nutch

Copy conf/nutch-default.xml to conf/nutch-site.xml and edit conf/nutch-site.xml:

$ cd ~/nutch

$ cp conf/nutch-default.xml conf/nutch-site.xml && vim conf/nutch-site.xml

Search the http.agent.name key and set value to your crawler name.

Then copy the hadoop-env.sh, core-site.xml, hdfs-site.xml, mapred-site.xml, master, slaves from hadoop/conf to nutch/conf, after that, copy nutch/conf to nutch/runtime/local/conf.

$ cd ~/hadoop/conf

$ cp hadoop-env.sh core-site.xml hdfs-site.xml mapred-site.xml master slaves ~/nutch/conf/

$ cd ~/nutch

$ cp conf/* runtime/local/conf/

Here is the key point

We have to rebuild nutch 1.4 with ant, or we’ll get an error which tell us something like the http.agent.name doesn’t work even we have edit our nutch-site.xml file. Now run ant to rebuild nutch:

$ export CLASSPATH=.:/home/nutch/nutch/runtime/local/lib

$ ant

After ant finishing, copy the nutch-1.4.job and nutch-1.4.jar to the deploy and local workspace.

$ cp build/nutch-1.4.job runtime/deploy/

$ cp build/nutch-1.4.jar runtime/local/lib/

Now the cluster has been built successfully and we can use it just as what we did in Nutch 1.2. The only different operation is that the start-all.sh of hadoop is not placed in nutch’s bin folder anymore. We could execute it from $HADOOP_HOME/bin.

Comments are closed.