Installing RapidMiner Radoop on RapidMiner Studio
RapidMiner Radoop is client software with an easy-to-use graphical interface for processing and analyzing big data on a Hadoop cluster. It can be installed on RapidMiner Studio and/or RapidMiner Server, and provides a platform for editing and running ETL, data analytics, and machine learning processes in a Hadoop environment. RapidMiner Radoop runs on any platform that supports Java.
Integrating RapidMiner Radoop into the RapidMiner advanced analytics suite is as easy as downloading the extension and making some configuration changes. The following instructions describe the process for installing the RapidMiner Radoop extension.
The installation instructions assume that you have completed the following tasks. If any of these prerequisites have not yet been met, be sure to finish them before proceeding with the installation.
Verifying port availability for RapidMiner Radoop
RapidMiner Radoop requires access to a variety of ports on the cluster. Make note of your port assignments for later use when configuring cluster connections and security settings. The table in the networking setup section lists the default port assignments for various components.
Hadoop cluster requirements
RapidMiner Radoop requires a connection to a properly configured Hadoop cluster where it will execute all of its main data processing operations and store the data related to these processes. The cluster contains the following components:
- a supported Hadoop distribution, which consists of an HDFS and MapReduce/YARN
- a distributed data warehouse system (Hive or Impala)
- Java 8 or newer on the cluster nodes (necessary for applying most RapidMiner models in-Hadoop)
- optionally, Apache Spark. Below you can find detailed descriptions about the Spark requirements on the cluster.
Using all Spark operators
Apache Spark 1.5.0 was released in September 2015 and is not yet included in all Hadoop distributions. If you want to use every Spark operator, and your Hadoop cluster does not have 1.5 or above, then it needs to be installed on the cluster manually. You can do so by downloading it from the Apache Spark download page. Please take care that the package type should meet your cluster setup.
- For Hadoop 2.6 or later (you need to change the download link and the path for older Hadoop versions):
hadoop fs -mkdir -p /tmp/spark
wget -O /tmp/spark-1.5.2-bin-hadoop2.6.tgz http://d3kbcqa49mib13.cloudfront.net/spark-1.5.2-bin-hadoop2.6.tgz
tar xzvf /tmp/spark-1.5.2-bin-hadoop2.6.tgz -C /tmp/
hadoop fs -put /tmp/spark-1.5.2-bin-hadoop2.6/lib/spark-assembly-1.5.2-hadoop2.6.0.jar /tmp/spark/
For using the Spark Script operator, you need to have Python 2.6+ or Python 3.4+ (for PySpark scripts) and R 3.1+ (for SparkR scripts) installed on the cluster nodes. To be able to use MLlib functions in Python, please also install the numpy package. Because of PARQUET-136 Hive version 1.2.0 or later is recommended.
Consider the following differences between using Hive and Impala as the query engine for RapidMiner Radoop.
- Sort operator: Impala does not support the ORDER BY clause without a LIMIT specified. You may use the Hive Script operator to perform a sort by using an explicit LIMIT clause as well. (The ORDER BY clause is supported in Impala 1.4.0 and later.)
- Generate Rank operator: Impala does not support the RANK and DENSE_RANK clause.
- Add Noise operator: Add Noise is not supported on Impala.
- Nominal to Numerical operator: Unique integers method of Nominal to Numerical is not supported on Impala.
- Pivot Table operator: Pivot Table is not supported on Impala.
- Apply Model operator: Model application with Impala is not supported.
- Update Model and Naive Bayes operators: On Impala, RapidMiner Radoop does not support Naive Bayes learning or model updating by operator.
- Correlation Matrix, Covariance Matrix, and Principal Component Analysis operators: The CORR() function is not supported by Impala.
- Performance operators: The Performance (Regression) operator is not supported on Impala. For the Performance (Classification) operator, only the following criterions are supported on Impala: Accuracy, Classification Error, and Kappa.
- Aggregation functions: Some aggregation functions are not supported by Impala. This may affect Generate Attributes, Normalize, and Aggregate operators. For these limitations, RapidMiner Radoop provides design-time errors, even though Impala allows you to run them.
- No advanced Hive settings: You cannot set advanced Hive parameters for an Impala connection.
- Killing a process: Stopping a process does not kill the current job on the cluster (but also does not start a new process).
Hadoop cluster considerations
Although RapidMiner Radoop easily connects to all supported platform, you may require special settings if you encounter a problem when trying to use it with one of the listed distributions. Details can be found in the Distribution Specific Notes section. This section lists a few considerations that you should be aware of when choosing an HDFS or data warehousing platform:
dse-<>version>.jar)) file, and copy it to a local directory of the client. To configure a RapidMiner Radoop connection to a DataStax cluster, refer to the DSE distribution notes.
Evaluate the Impala limitations to determine whether it is an acceptable alternative for your organization. For example, if you need advanced features (like model scoring), you must use Hive. If you use both Hive and Impala, consult the Impala Documentation for information on sharing metadata between the two frameworks. If using both, metadata used in Impala must be reloaded to reflect any metadata changes (such as creating new tables) made in Hive. (This can be done by enabling the reload impala metadata parameter of the Radoop Nest.)
Installing RapidMiner Radoop on RapidMiner Studio
The RapidMiner Radoop client installation is straight-forward, assuming the prerequisites are met and the appropriate ports are available. The extension can be easily RapidMinerinstalled from the Marketplace.
If you are using RapidMiner Radoop 2.5 or earlier version, or if you want to install the extension manually, follow the steps below.
Manual extension install
In Step 3, you will move the files to:
For enabling the plugin for all users on a machine (global install), move the files into the install folder at
In case of RapidMiner Studio versions 6.4 and later, for enabling the plugin only for a single user, move the files to.RapidMiner/extensions/ at the user home folder. If the extensions folder does not exist, create it.
For Mac users running RapidMiner Studio versions 6.4 and later, move the files into.RapidMiner/extensions/.If the extensions folder does not exist, create it. Note that RapidMiner Studio creates
.RapidMineras a hidden folder, so you must set your Mac to display hidden files and folders if you cannot see it.
For Mac users running RapidMiner Studio versions prior to 6.4, move the files into the install folder at
The process is as follows:
- 1. If necessary, quit RapidMiner Studio.
- the downloaded RapidMiner Radoop JAR file (rapidminer-Radoop-onsite-.jar);
- if using RapidMiner Radoop 2.5 or earlier version, your RapidMiner Radoop license file found in the confirmation email (radoop.license). Note: this license file does not work starting from version 2.6.
2. Download the RapidMiner Radoop plugin, a JAR file, from the location specified in your confirmation email.
3. Move the following files to the RapidMiner Studio directory on the host system:
4. With both the license (for version 2.5 or earlier) and JAR files moved, start RapidMiner.
If the extension has been successfully intalled, Hadoop Data appears in the middle, as a new view, in the RapidMiner Studio startup window:
That’s it. Now that RapidMiner Radoop is installed, see the section on configuring connections to complete the installation.
Consider the following security measures to secure your HDFS and data warehouse infrastructure:
- Apply the firewall settings for your data warehouse system (optional but recommended).
- Use Kerberos or Apache Sentry for securing your cluster. See the Hadoop security section for security configuration suggestions.