XGBoost is a library that is designed for boosted (tree) algorithms. It has become a popular machine learning framework among data science practitioners, especially on Kaggle, which is a platform for data prediction competitions where researchers post their data and statisticians and data miners compete to produce the best models. For structured learning problems on Kaggle, it can be difficult to get into the top 10 without including XGBoost. Typically, data scientists use multi-thread single machines to train XGBoost models. Very few people have deployed XGBoost on a distributed environment and achieved good performance. As a machine learning package, Gradient Boosted Regression Trees (GBRT) is also applied in numerous production use cases. For example, most people will combine GBRT with Logistic Regression/Factorization Machines to solve CTR (click-through rate) prediction problems.
In this blog post, I’ll provide step-by-step instructions for setting up XGBoost under a 3-node Hadoop cluster (Ubuntu EC2 instances). We achieved good performance by running XGBoost through a Message Passing Interface (MPI) and MapR-FS, and we recommend setting up POSIX clients for XGBoost training tasks.
It is easy to set up distributed XGBoost with the NFS interface of the MapR File System. One can use MPI or Apache YARN to manage and run the distributed XGBoost library. With the high availability feature that MapR-FS provides, even when some nodes go down, the data will stay intact; we can still have XGBoost running on other nodes and achieve good performance. Note that this setup requires MapR 5.0 with an M5 or M7 license.
In addition to running XGBoost on MapR cluster nodes, I recommend that you run XGBoost on MapR POSIX clients. This is because you’ll increase training speed by dedicating several nodes to train the XGBoost model. Also, running XGBoost on MPI won’t affect the YARN resource management on MapR cluster nodes very much. Lastly, POSIX clients provide stable assess to MapR-FS.
I’ll skip the steps to install MapR on the 3-node AWS EC2 Ubuntu instances, since it is well documented here. Also, you’ll need to install clustershell on one node and make sure passwordless SSH between nodes are setup for the user.
First, install the necessary tools on each node:
clush -a 'sudo apt-get update && sudo apt-get install -y build-essential libcurl4-openssl-dev mpich2 git'
Then, get the wormhole and build it with distribute capacity:
clush -a 'cd ~/github && git clone https://github.com/dmlc/wormhole.git' clush -a 'cd ~/github/wormhole/ && make deps -j4 && make -j4 USE_S3=1'
The built wormhole is not working as it is. Hence, we want to build XGBoost separately.
clush -a 'rm -r -f ~/github/wormhole/repo/dmlc-core &&cd ~/github/wormhole/repo && git clone https://github.com/dmlc/dmlc-core.git' clush -a 'rm -r -f ~/github/wormhole/repo/Xgboost && cd ~/github/wormhole/repo && git clone https://github.com/dmlc/Xgboost.git' cd ~/github/wormhole/repo/dmlc-core/make (do this on 3 nodes separately) vi config.mk (in the config file, make change let S3 = 1) clush -a 'cp ~/github/wormhole/repo/dmlc-core/make/config.mk ~/github/wormhole/repo/dmlc-core/.' clush -a 'cd ~/github/wormhole/repo/dmlc-core && make' clush -a 'cd ~/github/wormhole/repo/Xgboost && make dmlc=../dmlc-core' clush -a 'cp ~/github/wormhole/repo/Xgboost/Xgboost ~/github/wormhole/bin/Xgboost.dmlc'
To run mpi, we need to set up as hostfile with either an IP address or the hostname (get by hostname -f):
Example: hosts 10.0.0.186 10.0.0.185 10.0.0.184
For example, if we want to use MPI to run XGBoost, we just need to do (hosts is the hostfile we set up, n determines the number of workers):
~/github/wormhole/tracker/dmlc_mpi.py -n 3 -H hosts ~/github/wormhole/bin/Xgboost.dmlc ./softmax.conf
Softmax.conf is a configuration file for running XGBoost. You can also throw a wrapper on with Python parameter tuning packages and subprocess calls to perform parameter tuning on the XGBoost model. You can also add the model to an ensemble or stacking process.
To run XGBoost in YARN, go to:
“/wormhole/repo/dmlc-core/yarn/src/org/apache/hadoop/yarn/dmlc” and modify the ApplicationMaster.java and Client.java to change the various YARN parameters. I found it is very difficult to fine-tune it to fully utilize the memory.
In the test environment, we set the random seed to be consistent in each environment: 3-nodes XGBoost running on MPI+MapR-FS take half the time to converge (early stopping round = 3) compared to a single node. 3-nodes YARN+HDFS performance is considerably much worse than MPI. The reason is that in YARN, one has to control the configuration to pre-assign memory to containers in the experiment. This is a challenge, since it is difficult to pre-assign memory for the training task. I have to cache the data to make sure the application is not killed by the YARN manager. To achieve the best performance on YARN, further tuning on the YARN parameters should be done. We also observed that MPI+MapR-FS can handle relatively much larger data. I haven’t been able to configure it to run on YARN. Also, in XGBoost, the matrix will be saved into a binary buffer, and the read and write performance is better on MapR-FS than on HDFS.
On a MapR POSIX client, the setup is the same; one just needs to mount a local directory to open up the NFS assess to MapR-FS.
In this blog post, I discussed how to set up XGBoost under a 3-node Hadoop cluster on MapR-FS. In a future post, I’ll discuss some use cases for this distributed machine learning framework.
If you have any additional questions about setting up XGBoost on MapR-FS, please ask them in the comments section below.