4394639915_28dbfd85cc_z

Hadoop Command Line Cheatsheet

Useful commands when using Hadoop on the command line

Filesystem

Full reference can be found in Hadoop Documentation.

ls

List the contents of provided directory.

put

Put the local file to provided HDFS location

get

Copy the file to the local file system

cat/text

Outputs the contents of HDFS file to standard output. Text command will also read compressed files and output uncompressed data.

Common usecase is that you want to check contents of the file, but not output the whole file. Pipe the contents to head.

cp/mv

cp is short for copy, copy file from source to destination.

mv is short for move, move file from source to destination.

chmod

Change the permissions of the file/directory. Uses standard Unix file permissions.

getmerge

Takes a source directory and concatenates all the content and outputs to a local file. Very useful as commonly Hadoop jobs will output multiple output files depending on the number of mappers/reducers you have.

rm

Deletes a file from HDFS. The -r means perform recursively. You will need to do this for directories.

By default the files will be moved to trash that will eventually be cleaned up. This means the space will not be immediately freed up. If you need the space immediately you can use -skipTrash, note this will mean you can reverse the delete.

du

Displays the sizes of directories/files. Does this recursively, so extremely useful for find out how much space you have. The -h option makes the sizes human readable.  The -s option summarises all the files, instead of giving you individual file sizes.

One thing to note is that the size reported is un-replicated.  If your replication factor is 3, the actual disk usage will be 3 times this size.

count

I commonly use this command to find out how much quota I have available on a specific directory (you need to add the -q options for this).

To work out how much quota you have used, SPACE_QUOTA - REMAINING_SPACE_QUOTA, e.g. 54975581388800 - 5277747062870 or 54.97TB - 5.27TB = 49.69TB left.

Note this figures are the replicated numbers.

Admin Report

Useful command for finding out total usage on the cluster. Even without superuser access you can see current total capacity and usage.

Hadoop Jobs

Launching Hadoop Jobs

Commonly you will have a fat jar file containing all the code for your map reduce job. Launch via:

A Scalding job is launched using:

If you need to kill a map-reduce job, use:

You can find the job id in the resource manager or in the log of the job launch. This can be used to kill any map-reduce job (Standard Hadoop, Scalding, Hive, etc.) but not Impala or Spark jobs for instance.