MapR provides compression for files stored in the cluster. Compression is applied automatically to uncompressed files unless you turn compression off. The advantages of compression are:
- Compressed data uses less bandwidth on the network than uncompressed data.
- Compressed data uses less disk space.
MapR supports three different compression algorithms:
- lz4 (default)
Compression algorithms can be evaluated for compression ratio (higher compression means less disk space used), compression speed and decompression speed. The following table gives a comparison for the three supported algorithms. The data is based on a single-thread, Core 2 Duo at 3 GHz.
|Compression Type||Compression Ratio||Compression Speed||Decompression Speed|
|lz4||2.084||330 MB/s||915 MB/s|
|lzf||2.076||197 MB/s||465 MB/s|
|zlib||3.095||14 MB/s||210 MB/s|
Note that compression speed depends on various factors including:
- block size (the smaller the block size, the faster the compression speed)
- single-thread vs. multi-thread system
- single-core vs. multi-core system
- the type of codec used
Compression is set at the directory level. Any files written by a Hadoop application, whether via the file APIs or over NFS, are compressed according to the settings for the directory where the file is written. Sub-directories on which compression has not been explicitly set inherit the compression settings of the directory that contains them.
If you change a directory's compression settings after writing a file, the file will keep the old compression settings---that is, if you write a file in an uncompressed directory and then turn compression on, the file does not automatically end up compressed, and vice versa. Further writes to the file will use the file's existing compression setting.
|Only the owner of a directory can change its compression settings or other attributes. Write permission is not sufficient.|
By default, MapR does not compress files whose filename extension indicate they are already compressed. The default list of filename extensions is as follows:
The list can be viewed with the config load command. Example:
You can turn compression on or off for a given directory in two ways:
- Set the value of the Compression attribute in the .dfs_attributes file at the top level of the directory.
- Use the command hadoop mfs -setcompression on|off/lzf/lz4/zlib <dir>.
If you choose -setcompression on without specifying an algorithm, lz4 is used by default. This algorithm has improved compression speeds for MapR's block size of 64 KB.
Suppose the volume test is NFS-mounted at /mapr/my.cluster.com/projects/test. You can turn off compression by editing the file /mapr/my.cluster.com/projects/test/.dfs_attributes and setting Compression=false. To accomplish the same thing from the hadoop shell, use the following command:
You can view the compression settings for directories using the hadoop mfs -ls command. For example,
The symbols for the various compression settings are explained here:
|U||Uncompressed, or previously compressed by another algorithm|
By default, MapReduce uses compression during the Shuffle phase. You can use the
-Dmapreduce.maprfs.use.compression switch to turn compression off during the Shuffle phase of a MapReduce job. For example: