Big data Hadoop skills are in high demand nowadays. For those you are new to this term, Big data means really a big data, it is a collection of large datasets that cannot be processed using traditional computing techniques and Hadoop is a software framework for storing and processing Big data. It is an open source tool build on Java platform and provides massive storage for any kind of data, enormous processing power and the ability to handle virtually limitless concurrent tasks or jobs.
So if you are planning your future in Big data Hadoop then you must be aware of the terms like Hadoop Distributed File System(HDFS), Cloudera Manager, Hive & Impala, Spark Architecture, Cluster Maintenance, Security, YARN etc. This post covers Hands-On Guides that you must perform to learn & become expert in BigData Hadoop Administration.
1. Activity Guide I: Cloudera Manager Installation
First of all, you should be aware of how to Install and Configure Cloudera Manager.
Cloudera Manager automates the installation and configuration of CDH and managed services on a cluster, requiring only that you have root SSH access to your cluster’s hosts, and access to the internet or a local repository with installation files for all these hosts. Cloudera Manager installation software consists of:
- A small self-executing Cloudera Manager installation program to install the Cloudera Manager Server and other packages in preparation for host installation.
- Cloudera Manager wizard for automating CDH and managed service installation and configuration on the cluster hosts. Cloudera Manager provides two methods for installing CDH and managed services: traditional packages (RPMs or Debian packages) or parcels. Parcels simplify the installation process and more importantly allows you to download, distribute, and activate new minor versions of CDH and managed services from within Cloudera Manager
The following illustrates a sample installation:
2. Activity Guide II: Cloudera Manager Console
Once you have gone through the installation process of Cloudera Manager, then you are ready to use & access the Cloudera Manager Console.
Cloudera Manager Admin Console is the web-based UI that you use to configure, manage, and monitor CDH.
If there are no services configured when you log into the Cloudera Manager Admin Console, the Cloudera Manager installation wizard displays. If services have been configured, the Cloudera Manager top navigation bar and Homepage display. In addition to a link to the Home page, the Cloudera Manager Admin Console top navigation bar provides the following features:
3. Activity Guide III: Hive & Impala Flow & Logs
In this Activity Guide, You will get to learn the Process Flow and Logs of Hive & Impala.
A major Impala goal is to make SQL-on-Hadoop operations fast and efficient enough to appeal to new categories of users and open up Hadoop to new types of use cases. Where practical, it makes use of existing Apache Hive infrastructure that many Hadoop users already have in place to perform long-running, batch-oriented SQL queries.
In particular, Impala keeps its table definitions in a traditional MySQL or PostgreSQL database known as the Metastore, the same database where Hive keeps this type of data. Thus, Impala can access tables defined or loaded by Hive, as long as all columns use Impala-supported data types, file formats, and compression codecs.
4. Activity Guide IV: Spark Architecture & Process Flow
The next task is to learn & Understand the Concept of Spark Architecture. In this Activity Guide you will learn about Spark components, the process flow of getting started with a Spark JOB,& how we Troubleshooting Spark Job.
Apache Spark is an open source, general-purpose distributed computing engine used for processing and analyzing a large amount of data. Just like Hadoop MapReduce, it also works with the system to distribute data across the cluster and process the data in parallel.
Apache Spark is considered as a powerful complement to Hadoop, big data’s original technology of choice. Spark is a more accessible, powerful and capable big data tool for tackling various big data challenges.
5. Activity Guide V: Data Ingestion Using Sqoop & Kafka
The Next topic is the introduction on Sqoop & Kafka, these tools are used for Data Ingestion from other external sources.
- Sqoop: A common ingestion tool that is used to import data into Hadoop from any RDBMS. Sqoop provides an extensible Java-based framework that can be used to develop new Sqoop drivers to be used for importing data into Hadoop. Sqoop runs on a MapReduce framework on Hadoop, and can also be used to export data from Hadoop to relational databases.
- Kafka: Kafka is a highly scalable messaging system that efficiently stores messages on disk partitions in a Kafka topic. Producers publish messages as Kafka topics, and Kafka consumers consume them as they please.
6. Activity Guide VI: Oozie & How it Works in Scheduling the JOBS.
The next activity guide is to get the understanding on Oozie and the Job scheduler.
CDH, Cloudera’s open-source distribution of Apache Hadoop and related projects, includes a framework called Apache Oozie that can be used to design complex job workflows and coordinate them to occur at regular intervals. In this how-to, you’ll review a simple Oozie coordinator job, and learn how to schedule a recurring job in Hadoop.
7. Activity Guide VII: Cluster Maintenance: Directory Snapshots
In this Activity guide, You will get to know about the Hadoop Clusters and Directory Snapshot to perform the steps for Adding and Removing Cluster Nodes.
Hadoop clusters require a moderate amount of day-to-day care and feeding in order to remain healthy and in optimal working condition. Maintenance tasks are usually performed in response to events: expanding the cluster, dealing with failures or errant jobs, managing logs, or upgrading software in a production environment.
Cloudera Manager supports both HBase and HDFS snapshots:
- HBase snapshots allow you to create point-in-time backups of tables without making data copies, and with minimal impact on RegionServers. HBase snapshots are supported for clusters running CDH 4.2 or later.
- HDFS snapshots allow you to create point-in-time backups of directories or the entire filesystem without actually cloning the data. These snapshots appear on the filesystem as read-only directories that can be accessed just like any other ordinary directories. HDFS snapshots are supported for clusters running CDH 5 or later. CDH 4 does not support snapshots for HDFS.
Cloudera Manager enables the creation of snapshot policies that define the directories or tables to be snapshotted, the intervals at which snapshots should be taken, and the number of snapshots that should be kept for each snapshot interval and lets you create, delete and restore snapshots manually with Cloudera Manager.
You can get all these Step by Step Activity Guide including Live Interactive Sessions (Theory) when you register for our Big Data Hadoop Administration Training
If you register for our course, You’ll also get:
- Live Instructor-led Online Interactive Sessions
- FREE unlimited retake for next 3 Years
- FREE On-Job Support for next 3 Years
- Training Material (Presentation + Videos) with Hands-on Lab Exercises mentioned
- Recording of Live Interactive Session for Lifetime Access
- 100% Money Back Guarantee (If you attend sessions, practice and don’t get results, We’ll do full REFUND, check our Refund Policy)
Have queries? Contact us at firstname.lastname@example.org or if you wish to speak then mail your phone number and country code and a convenient time to speak.
If you are looking for commonly asked interview questions for Big Data Hadoop Administration then just click below and get that in your inbox or join our Private Facebook Group dedicated to Big Data Hadoop Members Only