4. How to use this playbooks

This project provides many kinds of playbooks to configure and manage nodes and serivces.

Note

To understand each playbook, please refer About playbooks section.

4.1. Assumption of this section

  • You should have servers described in Servers section.

  • You should be able to access all hosts listed in the inventry which is created by the following procedure. e.g. You should configure /etc/hosts to access nodes by hostname in advance.

4.2. How to configure Ansible execution environment

If you have not configured Ansible execution environment, you can use the following playbooks for it.

In this section, we start from installing Ansible packages.

4.2.1. Install packages

Install EPEL repository

$ sudo yum install -y epel-release

Install Ansible

$ sudo yum install -y ansible

4.2.2. Clone playbooks

Clone this projects to any paths.

E.g.

$ cd ~
$ mkdir Sources
$ cd Sources
$ git clone https://github.com/dobachi/ansible-hadoop.git ansible

4.2.3. Create inventry

Create an inventry for your environment. You can use examples of this project, hosts.mdedium_sample and hosts.large_sample.

$ cp hosts.medium_sample hosts.test
$ ln -s hosts.test hosts

Modify the top group of the inventry and hostnames in groups.

$ vim hosts

4.2.4. Create ansible.cfg

Create ansible.cfg refering an example of this project.

$ cp ansible.cfg.sample ansible.cfg

The important differences of the default ansible.cfg, which you can find /etc/ansible/ansible.cfg, is

  • hostfile = hosts

    • To use an inventry file in the current directory.

  • library = /usr/share/ansible:library

    • To include “library” directory in the current dicrectory.

  • roles_path = roles

    • To use roles in the current directory.

4.2.5. Try ping to all nodes

Check whether all nodes are reachable and “sudo” is available

$ ansible -m ping hadoop_all -k -s

4.3. How to boot EC2 instances for Hadoop cluster

If you want to use Hadoop on EC2 instances, you can use playbooks/operation/ec2/hadoop_nodes_up.yml to boot instances.

4.3.1. Define environment variables for AWS access

We use environment variables to configure AWS access keys. Define AWS_ACCESS_KEY and AWS_SECRET_KEY in your ~/.bashrc

export AWS_ACCESS_KEY=XXXXXXXXXXXXXXXXXXXXXXXXx
export AWS_SECRET_KEY=XXXXXXXXXXXXXXXXXXXXXXXXX

If you don’t have AWS keys, create keys while referring AWS web site.

4.3.2. Define parameters for ec2_hadoop role

You can find the parameter description for ec2_hadoop role in roles/ec2_hadoop/defaults/main.yml

To define your own parameters, you need to create the group variable file (e.g. group_vars/all/ec2) and write parameter defines in this file.

The following is an example of group_vas/top.

ec2_hadoop_group_id: sg-xxxxxxxx

ec2_hadoop_accesskey: xxxxx

ec2_hadoop_itype: xx.xxxxx

ec2_hadoop_master_image: ami-xxxxxxxx
ec2_hadoop_slave_image: ami-xxxxxxxx
ec2_hadoop_client_image: ami-xxxxxxxx

ec2_hadoop_region: xx-xxxxxxxxx-x

ec2_hadoop_vpc_subnet_id: subnet-xxxxxxxx

If you don’t define required parameters, you will see some errors, like:

One or more undefined variables: 'ec2_hadoop_group_id' is undefined

4.3.3. Apply playbook

Execute ansible-playbook command.

$ ansible-playbook playbooks/operation/ec2/hadoop_nodes_up.yml -c local

As a result, you can find an IP address list, an ansible inventory file and an example of /etc/hosts used in EC2 instances in /tmp/ec2_<unix epoc time>. <unix epoc time> is the time you executed this playbook.

4.3.4. (supplement) When you restart ec2 instances

When you restart ec2 instances, public IP addresses may change. You can obtain new IP address tables by executing the playbook.

$ ansible-playbook playbooks/operation/ec2/hadoop_nodes_up.yml -c local

4.4. How to configure host names of nodes

If you want to configure hostname of nodes, You can use “common” role and related playbooks.

Execute ansible-playbook command with common_only_common.yml

$ cd /etc/ansible
$ ansible-playbook playbooks/conf/common/common_only_common.yml -k -b -e "common_config_hostname=True server=hadoop_all"

This is usefull for configuration of EC2 instance, because your node may have variety of hostname after each node booted.

4.5. How to configure Bigtop HDFS/YARN environment

You can construct Bigtop HDFS/YARN environment by ansible-playbook command.

4.5.1. Preparement

If you have not configured Ansible execution environment, you should configure it. You can reference How to configure Ansible execution environment section.

4.5.2. Procedure

In the following example, we configure common_hosts_replace is True. As a result of this parameter configuration, Ansible replace /etc/hosts by Ansible driver server’s /etc/ansible/roles/common/files/hosts.default

$ ansible-playbook playbooks/conf/hadoop/hadoop.yml -k -b -e "common_hosts_replace=True"
$ ansible-playbook playbooks/operation/hadoop/init_zkfc.yml -k -b
$ ansible-playbook playbooks/operation/hadoop/init_hdfs.yml -k -b

Start services

$ ansible-playbook playbooks/operation/hadoop/start_cluster.yml -k -b

You may need to clean up zkfc environments when you failed start HDFS.

$ ansible-playbook playbooks/operation/hadoop/bootstrap_nnstandby.yml -k -b
$ ansible-playbook playbooks/operation/hadoop/init_zkfc.yml -k -b
$ ansible-playbook playbooks/operation/hadoop/init_hdfs.yml -k -b
$ ansible-playbook playbooks/operation/hadoop/start_cluster.yml -k -b

4.5.3. How to install Spark environment on Bigtop environment

You can install Spark Core into Client node by the following command

$ ansible-playbook playbooks/conf/spar/spark_client.yml -k -s
$ ansible-playbook playbooks/conf/spar/spark_misc.yml -k -s

If you want to start Spark’s history server, please execute the following command.

$ ansible-playbook playbooks/operation/hadoop/start_spark_historyserver.yml -k -s

4.6. How to configure Bigtop Pseudo environment

You can construct Bigtop HDFS/YARN environment by ansible-playbook command.

4.6.1. Preparement

If you have not configured Ansible execution environment, you should configure it. You can reference How to configure Ansible execution environment section.

4.6.2. Procedure

In the following example, we configure common_hosts_replace is True. As a result of this parameter configuration, Ansible replace /etc/hosts by Ansible driver server’s /etc/ansible/roles/common/files/hosts.default

$ ansible-playbook playbooks/conf/hadoop_pseudo/hadoop_pseudo.yml -k -b -e "common_hosts_replace=True"
$ ansible-playbook playbooks/operation/hadoop_pseudo/init_hdfs.yml -k -b

Start services

$ ansible-playbook playbooks/operation/hadoop_pseudo/start_cluster.yml -k -b

4.7. How to install Ganglia environment

You can install Gaglia services with the following command:

.. code-block:: shell

$ ansible-playbook playbooks/conf/ganglia/ganglia_all.yml -k -s

4.7.1. How to use unicast for communication between gmonds

This playbook uses multicast for communication between gmonds as default. In some situcation, you may want to use unicast. For example, you are using ec2 of AWS.

The parameter “ganglia_slave_use_unicast” is used to define whether you use unicast or not. If you set this parameter True in your group_vars, you can use unicast.

Example(group_vars/all/ganglia):

ganglia_slave_use_unicast: True

Please configure the parameter “ganglia_slave_host” as well as “ganglia_slave_use_unicast” This parameter is used to define the destination which each gmond sends metrics, and should be a representative node which gmetad connect.

4.8. How to install and configure InfluxDB and Grafana

You can install InfluxDB and Grafana services with the followign command. Be careful for a machine to which you install InfluxDB, because InfluxDB uses 8088 port but Hadoop YARN ResourceManager also use 8088 port.

$ ansible-playbook playbooks/conf/influxdb/all.yml -k -s

You can access http://<Grafana server>:3000/ to watch grafana.

4.9. How to install Spark community edition

4.9.1. Obtain package or compile sources

You can get Spark pacakge from Spark official download site .

If you want to use a package compiled by your self, you should build it according to Spark offical build procedure .

You can also use playbooks/operation/spark_comm/make_spark_packages.yml to build it. When you use this playbook, please specify the following parameters used in this playbook.

  • spark_comm_src_dir

  • spark_comm_version

  • spark_comm_mvn_options

  • spark_comm_hadoop_version

4.9.2. Confiure parameters

You can use playbooks/conf/spark_comm/all.yml to configure Spark community edition envirionment.

This playbooks and roles expect to get Spark tar package by HTTP method. You should configure the following parameter to specify where Ansible should get Spark tar package.

  • spark_comm_package_url_base

  • spark_comm_package_name

The download URL is consited like {{ spark_comm_package_url_base }}/{{ spark_comm_package_name }}.tgz For example, if the download URL is “http://example.local/spark/spark-1.4.0-SNAPSHOT-bin-2.5.0-cdh5.3.2.tgz”, spark_comm_package_url_base is “http://example.local/spark” and spark_comm_package_name is “spark-1.4.0-SNAPSHOT-bin-2.5.0-cdh5.3.2”.

Note

spark_comm_package_name does not include “.tgz”

4.9.3. Execute playbooks

After configuration of parameters, you can execute Ansible playbooks.

$ ansible-playbook playbooks/conf/spark_comm/all.yml -k -s

4.9.4. Stat history server

Start Spark’s history server by the following command.

$ ansible-playbook playbooks/operation/spark_comm/start_spark_historyserver.yml -k -s

4.10. Configure Zeppelin

4.10.1. Obtain sources and build

First, according to Official README , you need to compile source codes and make a package.

Please take care about the compile option. You should specify Spark and Hadoop versions you use now.

The following is an example to configure CDH5.3.3、Spark1.3、YARN environment.

$ mvn clean package -Pspark-1.3 -Dhadoop.version=2.5.0-cdh5.3.3 -Phadoop-2.4 -Pyarn -DskipTests

You can also use playbooks/operation/zeppelin/build.yml, the helper playbook. Before executing this playbook, please configure the following parameters in the playbook.

  • zeppelin_git_url

  • zeppelin_src_dir

  • zeppelin_version

  • zeppelin_comiple_flag

  • zeppelin_hadoop_version

Finally, the playbook to configure Zeppelin make use of the package which you compiled the above procedure. The package is downloaded from web service by HTTP, so that you need to put the package on a HTTP web server.

4.10.2. Executing playbook

To configure Zeppelin, please execute the following playbook.

$ ansible-playbook playbooks/conf/zeppelin/zeppelin.yml -k -s

After finishing configuration, you need to start Zeppelin service.

$ ansible-playbook playbooks/operation/zeppelin/start_zeppelin.yml -k -s

4.11. Configure Kafka cluster

4.11.1. Information

We assume that Zookeeper ensemble was congured on master01, master02 and master03. If you have any other Zookeeper ensemble, you should modify kafka role’s parameters.

4.11.2. Executing playbook

To configure Kafka cluster, please execute the following playbook.

$ ansible-playbook playbooks/conf/kafka/kafka_broker.yml -k -s

After finishing configuration, you need to start Kafka cluster.

$ ansible-playbook playbooks/operation/kafka/start_kafka.yml -k -s

4.12. Configure Confluent services

4.12.1. Information

We assume that Zookeeper ensemble was congured on master01, master02 and master03. If you have any other Zookeeper ensemble, you should modify kafka role’s parameters.

4.12.2. Executing playbook

To configure Kafka broker cluster, please execute the following playbook.

$ ansible-playbook playbooks/conf/confluent/kafka_broker.yml -k -s

After finishing configuration, you need to start Kafka cluster.

$ ansible-playbook playbooks/operation/start_kafka_server.yml -k -s

As the same as the above procedure, you can install Schema Registry and Kafka REST Proxy by using kafka_schema.yml and kafka_rest.yml in playbooks/conf/confluent directory. And, use the following playbooks to these services,

$ ansible-playbook playbooks/operation/start_schema_registry.yml -k -s
$ ansible-playbook playbooks/operation/start_kafka_rest.yml -k -s

4.13. Configure Ambari

To install the basic packages, execute the following command.

$ ansible-playbook playbooks/conf/ambari/ambari_server.yml -k -s

Install Ambari agent to all machines.

$ ansible-playbook playbooks/conf/ambari/ambari_agent.yml -k -s

Execute initilization of Ambari server.

$ ansible-playbook playbooks/operation/ambari/setup.yml -k -s

Then you can access Ambari web UI on “manage” node.

Note

Todo: blueprint

4.14. Configure Jenkins

To install Jenkins and related packages, execute the following command.

$ ansible-playbook playbooks/conf/jenkins/jenkins.yml -k -s

4.15. Configure Anaconda CE

To install Anaconda2 CE, execute the following command.

$ ansible-playbook playbooks/conf/anacondace/anacondace2.yml -k -s

To install Anaconda3 CE, execute the following command.

$ ansible-playbook playbooks/conf/anacondace/anacondace3.yml -k -s

The above command installs Anaconda CE pakcages to /usr/local/anacondace directory. If you want to configure PATH, please do it yourself.

4.16. Configure Hive

To install Hive and related packages, execute the following command.

$ ansible-playbook playbooks/conf/cdh5_hive/cdh5_hive.yml -k -b -e "server=hadoop_client"

The above command installs PostgreSQL and Hive packages as well as common packages. To Initialize PostgreSQL Database, execute the following command. This command remove existing database and initialize database.

$ ansible-playbook playbooks/operation/postgresql/initdb.yml -b -k -e "server=hadoop_client"
$ ansible-playbook playbooks/operation/postgresql/restart_postgresql.yml -b -k -e "server=hadoop_client"

To create user and database, execute the following command.

$ ansible-playbook playbooks/operation/cdh5_hive/create_metastore_db -k -b -e "server=hadoop_client"

To define schema, execute the following command on the Hadoop client.

$ cd /usr/lib/hive/scripts/metastore/upgrade/postgres
$ sudo -u postgres psql
postgres=# \c metastore
metastore=# \i hive-schema-1.1.0.postgres.sql
metastore=# \pset tuples_only on
metastore=# \o /tmp/grant-privs
metastore=#   SELECT 'GRANT SELECT,INSERT,UPDATE,DELETE ON "'  || schemaname || '". "' ||tablename ||'" TO hiveuser ;'
metastore-#   FROM pg_tables
metastore-#   WHERE tableowner = CURRENT_USER and schemaname = 'public';
metastore=# \o
metastore=# \pset tuples_only off
metastore=# \i /tmp/grant-privs
metastore=# \q

To start metastore service, execute the following command.

$ ansible-playbook playbooks/conf/cdh5_hive/cdh5_hive.yml -b -k -e "server=hadoop_client"
$ ansible-playbook playbooks/operation/postgresql/restart_postgresql.yml -b -k -e "server=hadoop_client"
$ ansible-playbook playbooks/operation/cdh5_hive/start_metastore.yml -k -b -e "server=hadoop_client"

If you also use Hive as a input of Spark, please copy hive-site.xml from /etc/hive/conf to /etc/spark/conf.

4.17. Configure Pseudo Alluxio

To configure pseudo Alluxio environment, please execute the following command.

$ ansible-playbook playbooks/conf/alluxio/alluxio_pseudo.yml -k -b

This deploys an alluxio pacakage under “/opt/alluxio/” and create a link, “/opt/alluxio/defualt” . The “alluxio” user and group are also created.

After the configuration, execute the followin commands to mount RAMFS and format it.

$ ansible-playbook playbooks/operation/alluxio_pseudo/format.yml -k -b

This creates a RAMFS space on “/mnt/ramdisk/alluxioworker” and formats it.

Then, we can start alluxio processes.

$ ansible-playbook playbooks/operation/alluxio_pseudo/start.yml -k -b

We can run tests with the following command.

$ ansible-playbook playbooks/operation/alluxio_pseudo/test.yml -c local -b -k -vvv

Note

To print STDOUT / STDERR messages, we use -vvv options.

If you want to stop processes, you can use the following commands.

$ ansible-playbook playbooks/operation/alluxio_pseudo/stop.yml -c local -b -k

4.18. Configure Alluxio on YARN

To configure Alluxio environment at the client, please execute the following command.

$ ansible-playbook playbooks/conf/alluxio/alluxio_yarn.yml -k -s

This configures /usr/local/alluxio, compiles sources, adds some directories to PATH, and so on.

Note

The role, alluxio_yarn, creates a tar file which is used when you deploy an application to YARN and replace alluxio-yarn.sh in Alluxio package. This is because the original alluxio-yarn.sh create tar files every time you deploy applications and it is not convenient.

If you want to deploy an Alluxio application to YARN, please execute the following command.

$ ansible-playbook playbooks/operation/alluxio_yarn/deploy_alluxio.yml -k -s

You can configure the following variables.

  • alluxio_yarn_hadoop_home: “/usr/lib/hadoop”

  • alluxio_yarn_yarn_home: “/usr/lib/hadoop-yarn”

  • alluxio_yarn_hadoop_conf_dir: “/etc/hadoop/conf”

  • alluxio_yarn_num_workers: “3”

  • alluxio_yarn_working_dir: “hdfs://mycluster/tmp”

  • alluxio_yarn_master: ‘{{ groups[“hadoop_slave”][0] }}’

4.19. Configure TPC-DS

To configure TPC-DS, please execute the following command.

$ ansible-playbook playbooks/conf/tpc_ds/tpc_ds.yml -k -s

The default target node is localhost. If you want to configure any other nodes, please execute the command with overwriting “server” variable like the following.

$ ansible-playbook playbooks/conf/tpc_ds/tpc_ds.yml -k -b -e "server=haddoop_client:hadoop_slave"

4.20. Configure Keras and Tensorflow

If you want to use GPU, you should download cuDNN package from NVIDIA’s download site manually. This is because we need to register NVIDIA’s site before downloading the package. In “cuda” role, we use cudnn-8.0-linux-x64-v5.1.solitairetheme8. Before executing the playbook, you should store cudnn-8.0-linux-x64-v5.1.solitairetheme8 in roles/cuda/files directory.

If you don’t want to use GPU, you don’t need to downloads cuDNN packages.

4.20.1. GPU

$ ansible-playbook playbooks/conf/tensorflow/keras_gpu.yml -k -b -e "server=hd-client01"

4.20.2. CPU

$ ansible-playbook playbooks/conf/tensorflow/keras.yml -k -b -e "server=hd-client01"