Getting started with HBase browser on Cloudera

Standard

Step 1: Click on Hue in the welcome screen. This will take you to the Hadoop UI of the cloudera.

1.jpg

Step 2: To load the sample data into HBase data store click on “Step 2: Examples” tab.

1

Step 3: Click on “HBase Browser” below. On clicking this will start initializing the few example tables to HBase database. Wait until the initialization part is done.

1.jpg

Step 4: Click on Data Browser tab, under that Hbase , now you will land up in HBase Browser Home location.

1.jpg

  • You can see few sample data have been initialized on HBase database.

1.jpg

Step 5: Now let’s construct the below database of a company’s employees

1.jpg

  1.  Click on the New Table button on the right top corner to add new table to the HBase database.1
  2. Fill in the table details like Table name as Employee and we have to column family names: Personal data and professional data. Click on “add an additional column family” to add more column family names. Once done Click submit.1.jpg
  3. Employee table is created now.1.jpg
  4. To add rows and data to the employee table click on the Employee link.
  5.  To add a new row click on to the ” New Row” button on the lower right corner of the screen.1.jpg
  6. Fill in the details of new row as shown in the image below. To add more fileds click on “Add Field” button. Once filling all the required details click “Submit” button to insert the row.1.jpg
  7. You can see the row has been inserted.1
  8. Likewise we can add few more columns and final table will be like this.1
  9. Now to search or write a query you can type it in the Search box as shown below. To search for employee 1 details type in “emp1” and hit enter or click search icon.1.jpg
  10. To get employee 1 Personal details the query will be ” emp1[Personal Data: ]
  11. To get employee 1 Personal and Professiona details the query will be  “emp1[Personal Data: , Professional Data: ]
  12. To get employee1 name, the query will be “emp1[Personal Data: Name ].

 

Hope you enjoyed learning your first HBase lesson.

 

Word Count example on Cloudera Eclipse

Standard

Step 1: Open eclipse present on the cloudera / CentOS desktop.

1

Step 2: Creating a Java MapReduce Project

File > New > Project > Java Project > Next.

“WordCount” as our project name and click “Finish”:

1

Step 3: Adding the hadoop libraries to the project.

  • Right click on WordCount project and select “Properties”. Click on  ” Java Build Path”1.jpg
  • Now click on Add External Jars, then, File System > usr > lib > hadoop

1

  • Select all jars, and click OK:

1.jpg

  • We need more external libs. Click on “Add External JARs…” again, then select all jar files in “client” folder and click “OK” button.

1.jpg

Step 4: Create Java Mapper, Reducer program

Right on “src” folder of the project wordcount

1.jpg

New>> Class >> In Name textbox give it as “WordCount” and click Finish.

1.jpg

Now write the below code in the WordCount.java file. Program reference is here: http://wiki.apache.org/hadoop/WordCount

import java.io.IOException;

import java.util.*;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.conf.*;

import org.apache.hadoop.io.*;

import org.apache.hadoop.mapreduce.*;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

 

public class WordCount {

 

public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {

private final static IntWritable one = new IntWritable(1);

private Text word = new Text();

 

public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

String line = value.toString();

StringTokenizer tokenizer = new StringTokenizer(line);

while (tokenizer.hasMoreTokens()) {

word.set(tokenizer.nextToken());

context.write(word, one);

}

}

}

 

public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {

 

public void reduce(Text key, Iterable<IntWritable> values, Context context)

throws IOException, InterruptedException {

int sum = 0;

for (IntWritable val : values) {

sum += val.get();

}

context.write(key, new IntWritable(sum));

}

}

 

public static void main(String[] args) throws Exception {

Configuration conf = new Configuration();

 

@SuppressWarnings(“deprecation”)

Job job = new Job(conf, “wordcount”);

job.setJarByClass(WordCount.class);

 

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(IntWritable.class);

 

job.setMapperClass(Map.class);

job.setReducerClass(Reduce.class);

 

job.setInputFormatClass(TextInputFormat.class);

job.setOutputFormatClass(TextOutputFormat.class);

 

FileInputFormat.addInputPath(job, new Path(args[0]));

FileOutputFormat.setOutputPath(job, new Path(args[1]));

 

job.waitForCompletion(true);

}

 

1.jpg

Step 5: Export the project as JAR.

Right click on WordCount project and select “Export”>> Java >> Jar file >> Next>>

In the JAR file textbox given as /home/cloudera/WordCount.jar

Click Finish>> Ok

1.jpg

Step 6: View the jar file exported

Click Applications >> System Tools >> Terminal

Type as follows:

cd /home/cloudera

ls

1.jpg

Step 7:  Creating the input file for the MapReduce Program

In the terminal Type as follows:

vi /gome/cloudera/inputfile.txt

This will open inputfile.txt in an editor . Now press “i” in the keyboard to write on the file like below:

 

1

To save this file written press the ” esc:wq” keys on the keyboard

To view the contents of the file created type on the terminal:

cat /home/cloudera/inputfile.txt

1.jpg

Step 8: Moving the input file to the hadoop file system to prepare for processing/ executing the program.

  • To look at hadoop file system type on the terminal

hdfs dfs -ls /

Here / indicates the root directory of the hadoop files system(hdfs).

1.jpg

  • Create a input directory in hdfs using the command

hdfs dfs -mkdir /input

1.jpg

 

  •  Move the inputfile.txt from linux file system to hadoop file system by using the command as below:

hdfs dfs -put /home/cloudera/inputfile.txt /input/

1.jpg

  • To view the file moved to hdfs type in:

hdfs dfs -cat /input/inputfile.txt

1

Step 8: Running MapReduce program on hadoop

Type in the following command to execute the WordCount Program

hadoop jar /home/cloudera/WordCount.jar WordCount /input/inputfile.txt /output_1

Note: Each time you run the above command you need to give different name for the output directory  like in above example it is output_1 . Next time when you run the same command you need to give output_2 or some other name.

1.jpg

The below shows the program has been successfully executed

1.jpg

Step 9: Final thing is how to view the output of the program/job executed.

  • Type in the following on the terminal to view the output_1 directory

hdfs dfs -ls /output_1

1.jpg

 

  • Last and final is to view the output file

Type in the following on the terminal to view the output file

hdfs dfs -cat /output_1/part-r-00000

1.jpg

 


			

Cloudera Installation on VMware

Standard

Step 1:  Download VMware software from https://www.vmware.com/products/player/playerpro-evaluation.html

Step 2: Download cloudera from  http://www.cloudera.com/downloads/quickstart_vms/5-5.html 

In this page select the version of cloudera , platform and click download as show below:

1

Step 3: Extract the downloaded folder and keep in a place to access it in next step via VMware Player.

Step 4: Open VMware player and click on Open virtual Machine on right side of the window.

1

Step 5: Now go to the folder where the cloudera software has been extracted and choose the cloudera quickstart vmware and click open. The Cloudera software has been added to the library like below:

1

Step 6: To start the virtual machine for the first time you can right click on it and click power on, click the power on button or Click play virtual machine. Now the Cloudera platform is being installed on the virtual machine. This step might take few minutes and it will ready to use .

Username: cloudera

Password: cloudera

 

1

 

Step 7: Cloudera is now installed and the welcome screen is like below:

1

Step 8: To close the cloudera, you can click on the Pause button and then close the VMware player application. On doing this you can save the work and resume the cloudera when you start the cloudera again. This will not shutdown the cloudera or the CentOS on which the cloudera is installed,

Word Count example using Pig Script

Standard

Step 1:

A word count Pig script , save this text file into a  file that ends in ‘.pig’ like ‘wordcount.pig’

Step 2:

Careful with new lines when cutting & pasting, you might need to run the following at the unix prompt  [unix]> dos2unix infile outfile or first try dos2unix –help

run as either pig -x local wordcount.pig

OR

>> pig -x mapreduce wordcount.pig

Step 3:

Note: pig by default only saves log files if there is an error

Thus, to save standard output from describe and dumps

run it as , for example

>> pig -x local > mylog_output

or run it interactively and cut & paste from the screen.

Step 4:

Don’t forget to set up input files in hdfs if you use mapreduce

>>  hdfs dfs -mkdir /user/cloudera/pigin

Step 5:

>>  hdfs dfs -copyFromLocal /home/cloudera/testfile* /user/cloudera/pigin

Step 6:

Don’t forget to set up an output folder in hdfs if you use mapreduce

>> hdfs dfs -mkdir /user/cloudera/pigoutnew

 

Step 7: Save the below code in WordCount.pig file

 

wordfile = LOAD ‘/user/cloudera/pigin/testfile*’ USING PigStorage(‘\n’) as (linesin:chararray);

wordfile_flat = FOREACH wordfile GENERATE FLATTEN(TOKENIZE(linesin)) as wordin;

 

wordfile_grpd = GROUP wordfile_flat by wordin;

 

word_counts = FOREACH wordfile_grpd GENERATE group, COUNT(wordfile_flat.wordin);

 

Step 8:

If you are running pig -x mapreduce don’t forget to

use hdfs commands to rm the file and rmdir for the output

Step 9:

>> STORE word_counts into ‘/user/cloudera/pigoutnew/word_counts_pig’;

Step 10:

And don’t forget to copy from hdfs into your local unix filesystem, (but your paths may differ!)

>> hdfs dfs -copyToLocal /user/cloudera/pigoutnew/word_counts_pig /home/cloudera/word_counts_pig

 

Sum of Multiples of 3 and 5

Standard

Problem Statement: 

If we list all the natural numbers below 10 that are multiples of 3 or 5, we get 3, 5, 6 and 9. The sum of these multiples is 23.

Find the sum of all the multiples of 3 or 5 below N.

Input Format
First line contains T that denotes the number of test cases. This is followed by T lines, each containing an integer, N.

Output Format
For each test case, print an integer that denotes the sum of all the multiples of 3 or 5 below N.

Constraints
1T105
1N109

Sample Input

2
10
100

Sample Output

23
2318

Program:

import java.io.*;
import java.util.*;



@SuppressWarnings("unused")
public class Question1 {

 public static void main(String[] args) {
 /* Enter your code here. Read input from STDIN. Print output to STDOUT. Your class should be named Solution. */
 
 @SuppressWarnings("resource")
 Scanner in = new Scanner(System.in);
 int t1;
 t1 = in.nextInt();
 int[] N = new int[t1];
 for(int i=0;i<t1;i++){
 N[i]=in.nextInt();
 }
 
 for(int i=0;i<t1;i++){
 int sum = 0;
 for(int j= 1; j <N[i];j++){
 if((j % 3 == 0) || (j %5 ==0)){
 sum += j;
 }
 }
 System.out.println(sum);
 }
 }
}

Setup PySpark on the Cloudera VM

Standard

From the top left menu, Open a terminal: Applications => System Tools => Terminal

Type:

sudo easy_install ipython==1.2.1

Hit enter, administrator password is cloudera.

Launch pyspark with IPython

Every time you need to open the pyspark shell, open a terminal and type:

PYSPARK_DRIVER_PYTHON=ipython pyspark

Hit enter, after the startup logs, you should see the pyspark console:

Check version

To make sure that PySpark started correctly, print out the version by typing in the PySpark IPython terminal:

sc.version

Verify that the output is:

u'1.3.0'

Installation of Cloudera in VirtualBox

Standard

Requirements: You must have the Cloudera Quickstart VM imported into VirtualBox. The Cloudera Quickstart VM should be off during the setup.
1. Open VirtualBox, select the cloudera-quickstart-vm and click the Settings button.

1

2. Select the Network button from the Settings window.

1

3. Click the triangle next to the word Advanced to show the advanced options.

1

4. Click the Port Forwarding button.

1

5. Click the + button.

1

6. In the text boxes, enter:
Name: 22
Protocol: TCP
Host IP:
Host Port: 2222
Guest IP:
Guest Port: 22
Then click the OK button and click the OK button again on the Setting window to save the settings.

1

7. Download and install Cyberduck from https://cyberduck.io/?l=en . Power on the Cloudera Quickstart VM and wait for it to boot up completely before proceeding.
8. Open Cyberduck and click the Open Connection Button.

1

9. Enter the following connection settings:
Connection Type: SFTP (SSH File Transfer Protocol)
Server: localhost
Port: 2222
Username: cloudera
Password: cloudera
Then click the Connect button.

1

10. Once you have connected you should see all of the files in the cloudera home directory. To download a file to your host computer, find the file then right click on the filename and choose the Download option.

1