Hadoop based Bootstrapped SVM

What is Bootstrapping?

Bootstrapping in machine learning is the process of taking an input dataset and randomly sampling a large number of smaller datasets from this dataset and then training independently a model for each of these datasets. Once the training is done the test data is evaluated against all the models and an aggregation over all the results is performed. The aggregation can involve a voting scheme, a sum, a weighted sum or a a more complex method such as training another classifier over the results. This method falls under the ensemble methods category. It is an excellent way of reducing variability of results and thus increasing confidence and support levels as well as increasing the accuracy of the classifiers.

How much does it cost?

The service is offered at the same cost of running hadoop in the amazon cloud. For more information see here.

The Procedure

Let’s jump right into what you need to do. Here are the steps:

What you need:

  1. The Java runtime installed on your linux box (JRE).
  2. Unzip facility install on linux. (sudo apt-get install unzip)
  3. An Amazon Ec2 account setup with all the required credentials (account id, keys, certificates etc.). If you have ever launched a machine on the AWS Ec2 service you already have all of these.
  4. Training and testing data in 1305 file format. Presently we do not support cross-validation though that is coming soon.

Download this package and unzip it in your favorite directory. To setup, start and terminate a hadoop cluster you need to edit the environment-variables file in the package. At this point we suggest you eyeball the file. There are 10 variables you need to edit. These are explained below. You may SKIP these explanations if you are an intermediate/advanced Amazon Cloud user.

  1. JAVA_HOME: This is the directory where your JDK or JRE is installed. This is the parent directory of the bin directory where the java executable is located.
  2. HADOOP_INSTALLATION_DIR: This is the parent directory where the hadoop scripts will be copied under a folder called hadoop. You may use you home directory.
  3. EC2_INSTALLATION_DIR: This is the parent directory where the Amazon Ec2 tools will be copied in a folder called ec2-api-tools. You may use you home directory.

The rest of the variables are Amazon specific and can be found under the security credentials page of you AWS account. If you have ever launched an Ec2 Instance you will already have all of these.

  1. EC2_PRIVATE_KEY: The path to the key created for your X.509 certificate. You can find these here under the X.509 Certificates tab. You may create a new one or use your existing one. Please provide the full path to the file containing the certificate.
  2. EC2_CERT: The certificate mentioned for point 4.
  3. AWS_ACCOUNT_ID: This is the 12 digit account number in format (XXXX-XXXX-XXXX) on the top right of your security credentials page. Please DO NOT include the hyphens.
  4. AWS_ACCESS_ID: Again found on your security credentials page under the tab “Access Keys”. It is the “Access Key ID”.
  5. AWS_SECRET_ACCESS_KEY: Found in the same place as 7. It is the “Secret Access Key”. Click show to view and copy.

If you have ever launched a Ec2 instance you used a key-pair which had a name and a file associated with it. You used this file to ssh into the machine you launched and during the launch specified the name of the key-pair to use. The next 2 variables are exactly these. You can log into the Amazon Web Services Console and click on “Key Pairs” on the side.

  1. NAME_OF_PRIVATE_KEY: The “Key Pair Name” which you want to use to log into the hadoop master node.
  2. PATH_TO_PRIVATE_KEY: The path to the local file where you have stored the key. This is necessary because the hadoop master node needs to communicate with the slaves and we copy the key to the master node. Rest assured out AMI’s follow the highest standards in network security.

Setup Hadoop

The hadoop setup does the following:

  1. Downloads and extracts the hadoop package.
  2. Downloads and extracts the Amazon Ec2 tools package.
  3. Updates the relevant files in the hadoop package and exports the required variables.

Run the command

./setup_hadoop

from the command line once you have finished editing environment-variables. That’s it you’re ready to launch hadoop clusters now.

Launch Hadoop Cluster

First, you will need to sign up to use our AMI. This is just the way Amazon Web Services works. It is a one click procedure and essentially informs you about the terms and conditions of the service. Please click here to go and enable your Ec2 account to use our service.

To launch a cluster you need to specify the name of the cluster and the number of slave nodes. To launch the cluster run the following command

./start_cluster <name of cluster> <number of slaves>

This will launch (<number of slaves> + 1) machine starting with the master. It waits for the master to be pingable through ssh and then issues the commands to launch the slaves and then dies. Thus, you can ssh into the master as soon as the script finishes but the cluster will not be ready, we have to wait for the slaves to boot. You can easily check this on the aws console.

Note: If this is the first time you will launch a cluster by the name of <name of cluster> then 2 new security groups will be created viz.

<name of cluster> and <name of cluster>-master

These are added to your account and the required permissions for the cluster to functions properly are created. The next time of you launch a cluster with the same name new groups will NOT be created.

Run Hadoop SVM Jobs

Upload the training and test files to the master node. Remember the user name for our AMI’s is “a1305”. You can use the same method as before to ssh and scp to the master. Then ssh into the master. Once there look at the file “/home/a1305/SampleSvmConfiguration”. You need to fill in the fields in this file. The names are quite self explanatory but here is a quick overview:

  1. JobName: Should uniquely identify a train-test sequence.
  2. InputFileName: Location of input training file.
  3. OutputFolderName: The folder into which the sampled files are copied. This must already exist and be empty.
  4. NumberOfSplits: The number of samples to extract from the training file.
  5. SampleRatio: What percentage of points to sample from the training file.
  6. Sigma: Bandwidth of Gaussian Kernel to use in the SVM. Soon we will have an auto-tuning SVM that will determine the best value for you.
  7. SvmCParamater: The C parameter of SVM. This can be left blank and we estimate a good value for it. Feel free to enter values for this if you know your data well.

Now run the following command

./SetupSvmTraining SampleSvmConfiguration

This creates and copies the necessary files to the HDFS (Hadoop distributed file system). It also creates a file “configure.txt” in the present working directory. Finally, it will output the hadoop command you need to run to start the training. Copy this command and run it from the same directory where the “configure.txt” file is.

Your bootstrapped SVM has started and you will be able to see this progress on the console. You can also view the progress using your browser at address http://<ip address of master>:50030.

Once the process is complete you will want to test the models. Run the following command:

./SetupSvmTesting <Job Name From Previous Step> <local path to test data file>

This command will simply output the hadoop command you need to run as well as a configure.txt file for this phase. Again, run the hadoop command from the same directory as the “configure.txt” file. Testing will begin.

Once the process has completed the aggregated results will be in the HDFS. You can view the HDFS directory with the following command.

/usr/local/hadoop/bin/hadoop dfs -ls /user/a1305/results/<Job Name>

The result files are in hadoop naming scheme and look like “part-00000”, “part-00001”. The number of parts is determined by the size of your test file. To copy all the files from HDFS to local dir use the following command:

/usr/local/hadoop/bin/hadoop dfs -copyToLocal /user/a1305/results/<Job Name>/* .

This will copy all the files in the results folder to your current directory. The format of the result files is

<0 based line number><tab><prediction>

If prediction is greater than 0, the procedure predicted a +1 and if the prediction is less than 0 then the prediction is -1. Presently we only support voting scheme of aggregation.

Finally if you want to have access to your models these are in “/user/a1305/models/<Job Name>” on HDFS and can be copied as explained before for results.

Suggestions and bug reports

We would like to know your opinion about our product. Please send us bug reports and suggestions at support@analytics1305.com