Bootstrapping in machine learning is the process of taking an input dataset and randomly sampling a large number of smaller datasets from this dataset and then training independently a model for each of these datasets. Once the training is done the test data is evaluated against all the models and an aggregation over all the results is performed. The aggregation can involve a voting scheme, a sum, a weighted sum or a a more complex method such as training another classifier over the results. This method falls under the ensemble methods category. It is an excellent way of reducing variability of results and thus increasing confidence and support levels as well as increasing the accuracy of the classifiers.
The service is offered at the same cost of running hadoop in the amazon cloud. For more information see here.
Let’s jump right into what you need to do. Here are the steps:
What you need:
Download this package and unzip it in your favorite directory. To setup, start and terminate a hadoop cluster you need to edit the environment-variables file in the package. At this point we suggest you eyeball the file. There are 10 variables you need to edit. These are explained below. You may SKIP these explanations if you are an intermediate/advanced Amazon Cloud user.
The rest of the variables are Amazon specific and can be found under the security credentials page of you AWS account. If you have ever launched an Ec2 Instance you will already have all of these.
If you have ever launched a Ec2 instance you used a key-pair which had a name and a file associated with it. You used this file to ssh into the machine you launched and during the launch specified the name of the key-pair to use. The next 2 variables are exactly these. You can log into the Amazon Web Services Console and click on “Key Pairs” on the side.
The hadoop setup does the following:
Run the command
./setup_hadoop
from the command line once you have finished editing environment-variables. That’s it you’re ready to launch hadoop clusters now.
First, you will need to sign up to use our AMI. This is just the way Amazon Web Services works. It is a one click procedure and essentially informs you about the terms and conditions of the service. Please click here to go and enable your Ec2 account to use our service.
To launch a cluster you need to specify the name of the cluster and the number of slave nodes. To launch the cluster run the following command
./start_cluster <name of cluster> <number of slaves>
This will launch (<number of slaves> + 1) machine starting with the master. It waits for the master to be pingable through ssh and then issues the commands to launch the slaves and then dies. Thus, you can ssh into the master as soon as the script finishes but the cluster will not be ready, we have to wait for the slaves to boot. You can easily check this on the aws console.
Note: If this is the first time you will launch a cluster by the name of <name of cluster> then 2 new security groups will be created viz.
<name of cluster> and <name of cluster>-master
These are added to your account and the required permissions for the cluster to functions properly are created. The next time of you launch a cluster with the same name new groups will NOT be created.
Upload the training and test files to the master node. Remember the user name for our AMI’s is “a1305”. You can use the same method as before to ssh and scp to the master. Then ssh into the master. Once there look at the file “/home/a1305/SampleSvmConfiguration”. You need to fill in the fields in this file. The names are quite self explanatory but here is a quick overview:
Now run the following command
./SetupSvmTraining SampleSvmConfiguration
This creates and copies the necessary files to the HDFS (Hadoop distributed file system). It also creates a file “configure.txt” in the present working directory. Finally, it will output the hadoop command you need to run to start the training. Copy this command and run it from the same directory where the “configure.txt” file is.
Your bootstrapped SVM has started and you will be able to see this progress on the console. You can also view the progress using your browser at address http://<ip address of master>:50030.
Once the process is complete you will want to test the models. Run the following command:
./SetupSvmTesting <Job Name From Previous Step> <local path to test data file>
This command will simply output the hadoop command you need to run as well as a configure.txt file for this phase. Again, run the hadoop command from the same directory as the “configure.txt” file. Testing will begin.
Once the process has completed the aggregated results will be in the HDFS. You can view the HDFS directory with the following command.
/usr/local/hadoop/bin/hadoop dfs -ls /user/a1305/results/<Job Name>
The result files are in hadoop naming scheme and look like “part-00000”, “part-00001”. The number of parts is determined by the size of your test file. To copy all the files from HDFS to local dir use the following command:
/usr/local/hadoop/bin/hadoop dfs -copyToLocal /user/a1305/results/<Job Name>/* .
This will copy all the files in the results folder to your current directory. The format of the result files is
<0 based line number><tab><prediction>
If prediction is greater than 0, the procedure predicted a +1 and if the prediction is less than 0 then the prediction is -1. Presently we only support voting scheme of aggregation.
Finally if you want to have access to your models these are in “/user/a1305/models/<Job Name>” on HDFS and can be copied as explained before for results.
We would like to know your opinion about our product. Please send us bug reports and suggestions at support@analytics1305.com