How to setup Nutch 2 with Cassandra 3 in Ubuntu?

Let’s see how to setup Nutch 2 with Cassandra 3 as a NoSQL database and Ubuntu as infrastructure.

Platform :

  • Ubuntu 14.04.2
  • VirtualBox

Python

  • Install Python : 

  • Then run :

  • Validate installation :

  • You will be brought to Python console where you can execute Python functions:

  • To exit it :

Oracle JDK

Download Oracle JDK 1.8.0 (Ex: jdk1.8.0_65)
Link :

  • Extract content into a folder. Ex :

  • Add $JAVA_HOME as an Environment Variable and Add its “bin” folder into the PATH.
    So, first edit .profile file :

  • Add at the bottom these 2 instructions :

  • Save the file (ctlr+s). Then reload the new PATH.

  • Validate PATH updated :

// TODO photo java-installed.png

  • Verify exported Env Var :

Cassandra

Cassandra 3.1.1

  • Settings Dependencies : Java 8 + Python 2.7
  • Download Cassandra 3.1.1
  • Run Cassandra :

// TODO cassandra-no-rpc-to-fix.png

  • Stop Cassandra :

  • Edit Cassandra configuration file :

  • Activate RPC, by changing :
    From : start_rpc: false
    To : start_rpc: true

  • Re’run Cassandra :

// TODO cassandra-rpc-enabled.png

Remark 1 :
If you need to access to Cassandra from another machine or from the Host in case of a VM,
you must change “localhost” to the appropriate hostname or IP address (Check the right interface IP address) :

$ gedit /opt/dev/apache-cassandra-3.1.1/conf/cassandra.yaml

From : rpc_address: localhost
To : rpc_address: 192.168.56.1

From : listen_address: localhost
To : listen_address: 192.168.56.1

$ gedit /opt/dev/apache-cassandra-3.1.1/conf/gora-cassandra-mapping.xml

(!) In remark 3 you will discover the settings necessary at a Remote Nutch level.

Remark 2 :
In case you had a version of Python < 2.7.10 :
Download Pyreadline 2.1
At : https://pypi.python.org/pypi/pyreadline#downloads
Full Link : https://pypi.python.org/packages/any/p/pyreadline/pyreadline-2.1.win-amd64.exe#md5=3e467d8c16c60d46847039d173fecc49
> Or you will get errors about pyreadline missing dependencies :
cqlsh: prompt for install of pyreadline if missing during cqlsh init

Ant

  • Settings Dependencies : Java
  • Download Ant 1.9.2
    Link :
  • Extract content into a folder. Ex :
    /opt/dev/apache-ant-1.9.6
  • Add Ant bin directory to PATH (you don’t need ANT_HOME env var to be exported ).
    So first, edit .profile file :
  • Add at the bottom these 2 instructions :
  • Save the file (ctlr+s). Then quit.
  • Reload PATH :
  • Validating PATH updating :

// TODO photo ant-installed.png

Nutch

Settings Dependencies : Java + Ant + Cassandra

  • Download Nutch 2.3 (apache-nutch-2.3-src.zip)
    Link :
  • Extract content into a folder. Ex :

  • Edit Ivy Configuration File :

  • Uncomment this code :

  • Edit Gora Properties :

  • Uncomment this code :

  • Edit Nutch Site Configuration File :
    $ sudo gedit /opt/dev/nutch/apache-nutch-2.3/conf/nutch-site.xml

Initially <configuration> is empty. So add this :

  • Build Nutch 2.3 :

  • Define the URLs to crawl :

  • Add this link :
    http://nutch.apache.org/

  • Crawl :

Remark 3 :
If Cassandra is configured in a different machine,
you got to change “localhost” to the appropriate hostname in these places :

* gora.properties :

* gora-cassandra-mapping.xml :

* gora-cassandra-mapping.xml :