tag:blogger.com,1999:blog-48491136288308408982024-03-19T18:30:09.269+05:30Blog [R]This is maama's second adda dedicated exclusively to articles on programming language -R! <br> You can find the original maama's adda <a href="http://shashidhar26.blogspot.com/"><b>here</b></a>Shashiahttp://www.blogger.com/profile/01602809065610957096noreply@blogger.comBlogger7125tag:blogger.com,1999:blog-4849113628830840898.post-5216377824570960182015-05-26T13:16:00.002+05:302021-02-26T05:54:47.512+05:30Replicating PROC FASTCLUS in R using kmeans<div dir="ltr" style="text-align: left;" trbidi="on">
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;">
A lot of businesses have bought into the idea of making decisions driven by data. And R is one of the foremost statistical tool that is helping these executives to take those ‘data-driven’ decisions. A flip-side of being data-driven in your approach is that you get accustomed to looking at a certain type of data and in a particular format, freaking out even if there is a slight deviation from this standard. Hence, even if far better methods are available to solve a problem, the data scientists must usually prescribe to what is widely accepted in the industry.
</span></div>
<br />
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;">
It is in this context that we explore the classical analytics problem of classification techniques. There are new ‘state of the art’ machine learning/ neural network algorithms in this space like trees, random forests, Bayesian networks, fuzzy logic, etc. and R has an implementation for each of these. However,even before we venture into these classification techniques (with labeled data), we would want to find labels in the data using segmentation techniques. To this end, the one which still finds wide-use in the industry, is the ‘k-means’ clustering solution, and its proprietary implementation on SAS: PROC FASTCLUS. In this post, we discuss how to replicate the FASTCLUS procedure on R. This post covers the following topics:
</span></div>
<br />
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;">
- Kmeans clustering in R <br />
- Replicating SAS PROC FASTCLUS in R <br />
- Statistics for variables, Total STD, Within STD, R-Square and RSQ/(1-RSQ) ratios in R kmeans <br />
- Visualization of k-means cluster results <br />
</span></div>
<br />
<div align="center" class="MsoNormal" style="text-align: justify;">
<span style="color: #222222; font-size: 12.0pt; line-height: 115%; mso-bidi-font-size: 11.0pt;">
<b>
K Means clustering in R
</b>
</span></div>
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;">
R implements k-means solution using the function kmeans. At the very basic level, a k-means algorithm is a minimization problem. It tries to partition ‘n’ observations into ‘k’ clusters such that the <i>‘within-cluster-sum-of-squares’</i> is minimum. It might not be the most efficient way to cluster data when you know nothing about the data, but if you have an idea that there are fixed patterns/finite number of segments in the data, you can use k-means to validate this intuition. Usually, a k-means solution is run after a more generalized hierarchical clustering technique. You can read more about the details of the algorithm, its drawbacks and overall efficiency <a href="http://en.wikipedia.org/wiki/K-means_clustering">here</a>
</span></div>
<br />
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;">
If you want to run k-means clustering on the famous Fisher’s Iris data on R, you just have to use the command:
</span></div>
<!-- HTML generated using hilite.me --><br />
<div style="background: #ffffff; border-width: .1em .1em .1em .8em; border: solid gray; overflow: auto; padding: .2em .6em; width: auto;">
<pre style="line-height: 125%; margin: 0;">kmeans(x <span style="color: #333333;">=</span> iris[<span style="color: #6600ee; font-weight: bold;">-5</span>],centers <span style="color: #333333;">=</span> <span style="color: #6600ee; font-weight: bold;">3</span>,iter.max <span style="color: #333333;">=</span> <span style="color: #6600ee; font-weight: bold;">100</span>)
</pre>
</div>
<br />
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;">
and right away, you’ll have the following output on the console:
</span></div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi1YqxKDlrxuv0oA5oSsMAKIqzueNFK4gAesn5oDR1oNdXjfFbcWZN8G4oSDxsP8VjnILClvM7wHOD8eW8u0jctm5TRvomMX9noanqqEfsEQ0GtwOMwz9E9LdVmdH4I9HDZ6hJSIugvm1pB/s1600/kmeans_1.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="291" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi1YqxKDlrxuv0oA5oSsMAKIqzueNFK4gAesn5oDR1oNdXjfFbcWZN8G4oSDxsP8VjnILClvM7wHOD8eW8u0jctm5TRvomMX9noanqqEfsEQ0GtwOMwz9E9LdVmdH4I9HDZ6hJSIugvm1pB/s640/kmeans_1.png" width="640" /></a></div>
<br />
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;">
Note: You might have a different starting solution because k-means is an optimizing algorithm which is highly dependent on the initial seeds. Run the command some two/three times and you’ll have the same solution. You could also use the <i>‘set.seed(XX)’</i> to always get the solution with 50, 62, 38 observations in the three clusters.
</span></div>
<br />
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;">
The ‘cluster means’ tells the individual means of the variables in the respective clusters. The clustering vector tells us the cluster to which every observation belongs to. The ratio <i>(between_SS/total_SS = 88.4%)</i> tells us that upto 88.4% of the variance is between the clusters and only 100-88.4 = 11.6% is within the clusters. This tells us that the clusters are more or less tightly packed – a desired objective of k-means clustering. The higher the between_SS/total_SS ratio(also known as the overall R-SQUARE), the better is our cluster solution.
</span></div>
<br />
<div align="center" class="MsoNormal" style="text-align: justify;">
<span style="color: #222222; font-size: 12.0pt; line-height: 115%; mso-bidi-font-size: 11.0pt;">
<b>
Replicating SAS PROC FASTCLUS in R
</b>
</span></div>
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;">
Due its dominance in business circles, the ODS output for SAS is something that most people are accustomed to looking at. And if you ran a PROC FASTCLUS on SAS on the same famous IRIS data, this how you would do it:
</span></div>
<!-- HTML generated using hilite.me --><br />
<div style="background: #ffffff; border-width: .1em .1em .1em .8em; border: solid gray; overflow: auto; padding: .2em .6em; width: auto;">
<pre style="line-height: 125%; margin: 0;">proc fastclus <span style="color: #008800; font-weight: bold;">data</span><span style="color: #333333;">=</span>sashelp.iris maxc<span style="color: #333333;">=</span><span style="color: #0000dd; font-weight: bold;">3</span> maxiter<span style="color: #333333;">=</span><span style="color: #0000dd; font-weight: bold;">10</span> <span style="color: #008800; font-weight: bold;">out</span><span style="color: #333333;">=</span>clus;
var SepalLength SepalWidth PetalLength PetalWidth;
run;
</pre>
</div>
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;">
And this is what the output would look like:
</span></div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEidgMa-KtZUJg4K_L24sHq_32Pf0QJkwQzgorBS6jTveOg-wcjp9OEAN5PkapL8yAofTDsC3K4gbvMM05AkcYWJ5GKi2XTcdXcIvUPfB_9LauM9qUE9uOsxjA5lixL5S_8_kSQffjMGmR8g/s1600/kmeans_02.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="340" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEidgMa-KtZUJg4K_L24sHq_32Pf0QJkwQzgorBS6jTveOg-wcjp9OEAN5PkapL8yAofTDsC3K4gbvMM05AkcYWJ5GKi2XTcdXcIvUPfB_9LauM9qUE9uOsxjA5lixL5S_8_kSQffjMGmR8g/s400/kmeans_02.png" width="400" /></a></div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj1Ep5wZHc473QVoD_c0_nO9TVwKtmvLfmRFPwXR4yRl5hy0wCGy3GKdPr_RzMoYs-GMGpydSCAZKNBSP2ohXMh3F4LpFo967gLVs6RsnTDSgaoHZjt1W35KzK824C45qmFjyOklqqSFRO0/s1600/kmeans_03.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="275" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj1Ep5wZHc473QVoD_c0_nO9TVwKtmvLfmRFPwXR4yRl5hy0wCGy3GKdPr_RzMoYs-GMGpydSCAZKNBSP2ohXMh3F4LpFo967gLVs6RsnTDSgaoHZjt1W35KzK824C45qmFjyOklqqSFRO0/s400/kmeans_03.png" width="400" /></a></div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjcBXkvyk6a-4bls86KMZiGkNWrdLWHH103j6x-HvJFO5k8RxuaEM1bz68JbQfy4ocSGwomCWMaTECVeqnJPaC9LqpPYnr7lsLYhkUdQAQbEYv5TDykYz11CzzSKZ5h_uFWZT36mHzLX-dz/s1600/kmeans_04.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="285" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjcBXkvyk6a-4bls86KMZiGkNWrdLWHH103j6x-HvJFO5k8RxuaEM1bz68JbQfy4ocSGwomCWMaTECVeqnJPaC9LqpPYnr7lsLYhkUdQAQbEYv5TDykYz11CzzSKZ5h_uFWZT36mHzLX-dz/s400/kmeans_04.png" width="400" /></a></div>
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;">
The first thing you notice is that there is a lot of output as compared to R. And it might be a bit overwhelming! But if you look at it closely, the crux of the output still remains that the final solution of 3 clusters has 38, 50 and 62 observations and the overall R_SQUARE value is 88.4%, both of which were already reported in R. However, the statistics for variables, the pseudo-F, CCC, cluster means/std deviations, etc are some of the additional outputs which SAS presents in a nice format and something which the businesses have been used to looking at.
</span></div>
<br />
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;">
Although it is not possible to replicate the SAS result 100% because the initial seeds chosen by SAS and R would vary considerably, there are some statistics like the <i>RSQ/(1-RSQ)</i> ratio per variable and the <i>pseudo-F</i>, which would definitely enhance the R output, and help to take a call on what variables are performing better in terms of separating the observations. I have tried to search for ready-made packages that could help but there seems to be none as of now. However, there is some help in the materials presented
<a href="https://books.google.com/books?id=0SHMAAAAQBAJ&pg=PA18&lpg=PA18&dq=total+STD,+within+STD+and+RSQ&source=bl&ots=KRAKWl7Kdx&sig=QNtLUlzBCl753VR9Yl5CR1Goi6A&hl=en&sa=X&ei=7sFdVcDPAabIsQSjnYGgBw&ved=0CCUQ6AEwAQ#v=onepage&q=total%20STD%2C%20within%20STD%20and%20RSQ&f=false">here</a>
and <a href="http://stats.stackexchange.com/questions/79097/validity-index-pseudo-f-for-k-means-clustering">here</a> which are the basis for the code below:
</span></div>
<br />
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;">
In order to replicate the exact output of SAS FASTCLUS, we first export the SAS’s IRIS data (which has all the variables <i>SepalLength, SepalWidth,PetalLength and PetalWidth</i> in mm as compared to IRIS data in R <i>datasets</i> package which has these in cm) and then import the same into R and then run the k-means:
</span></div>
<br />
<!-- HTML generated using hilite.me --><br />
<div style="background: #ffffff; border-width: .1em .1em .1em .8em; border: solid gray; overflow: auto; padding: .2em .6em; width: auto;">
<pre style="line-height: 125%; margin: 0;">sas_iris<span style="color: #333333;"><-</span>read.csv(file <span style="color: #333333;">=</span> <span style="background-color: #fff0f0;">'sas_iris.csv'</span>,header <span style="color: #333333;">=</span> <span style="color: #008800; font-weight: bold;">T</span>,sep <span style="color: #333333;">=</span> <span style="background-color: #fff0f0;">','</span>,fill <span style="color: #333333;">=</span> <span style="color: #008800; font-weight: bold;">T</span>)
sas_clus<span style="color: #333333;"><-</span>kmeans(x <span style="color: #333333;">=</span> sas_iris[<span style="color: #6600ee; font-weight: bold;">-1</span>],centers <span style="color: #333333;">=</span> <span style="color: #6600ee; font-weight: bold;">3</span>,iter.max <span style="color: #333333;">=</span> <span style="color: #6600ee; font-weight: bold;">100</span>)
sas_clus
</pre>
</div>
<br />
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;">
would produce the following output as we’ve seen before:
</span></div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjPN8_aUPQQ_q0_2q6hsTlICYuA8ViWVnhKQesCJi4sczYO2gk4fYmSI60q5S22VHSQcxj3khjeiu-zpu2C7Dvv8v60P5XW2L0E8MjX-9CIgW5hhYEdYpyzJMqMVU1sgXeBCWckxYAhtsjf/s1600/kmeans_05.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="322" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjPN8_aUPQQ_q0_2q6hsTlICYuA8ViWVnhKQesCJi4sczYO2gk4fYmSI60q5S22VHSQcxj3khjeiu-zpu2C7Dvv8v60P5XW2L0E8MjX-9CIgW5hhYEdYpyzJMqMVU1sgXeBCWckxYAhtsjf/s640/kmeans_05.png" width="640" /></a></div>
<br />
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;">
We see that although the absolute values for the within_SS have changed (due to change in variable scale), the overall R-SQ value still remains at 88.4% with the observations grouped into 3 clusters of 62, 38 and 50 as before.
</span></div>
<br />
<div align="center" class="MsoNormal" style="text-align: justify;">
<span style="color: #222222; font-size: 12.0pt; line-height: 115%; mso-bidi-font-size: 11.0pt;">
<b>
Statistics for variables and RSQ/(1-RSQ) ratio
</b>
</span></div>
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;">
We now know that the variables <i>SepalLength, SepalWidth,PetalLength and PetalWidth</i> can together create a great deal of separation between the species. But is there a way to statistically know which of these would be the variable with the highest degree of separation and which of these is the least. This is where the table with statistics for variables comes in – something that the R output seems to miss out. According the post <a href="https://www.linkedin.com/grp/post/77616-5808873957218070530">here</a> , a good way to find that out would be to run a simple linear regression of the variable against the classified cluster and get the adjusted R-Square as the proxy for the strength of the variable:
</span></div>
<br />
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;">
Something like this:
</span></div>
<!-- HTML generated using hilite.me --><br />
<div style="background: #ffffff; border-width: .1em .1em .1em .8em; border: solid gray; overflow: auto; padding: .2em .6em; width: auto;">
<pre style="line-height: 125%; margin: 0;">sas_iris<span style="color: #333333;">$</span>classified<span style="color: #333333;"><-</span>sas_clus<span style="color: #333333;">$</span>cluster
summary(lm(formula <span style="color: #333333;">=</span> SepalLength<span style="color: #333333;">~</span>classified,data <span style="color: #333333;">=</span> sas_iris))
summary(lm(formula <span style="color: #333333;">=</span> SepalWidth<span style="color: #333333;">~</span>classified,data <span style="color: #333333;">=</span> sas_iris))
summary(lm(formula <span style="color: #333333;">=</span> PetalLength<span style="color: #333333;">~</span>classified,data <span style="color: #333333;">=</span> sas_iris))
summary(lm(formula <span style="color: #333333;">=</span> PetalWidth<span style="color: #333333;">~</span>classified,data <span style="color: #333333;">=</span> sas_iris))
</pre>
</div>
<br />
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;">
Although it gives the R-Square, the variation is just too high to make any inference w.r.t strength of the variable.
</span></div>
<br />
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;">
An almost similar replica of the SAS output table can be generated by getting the Total Standard Deviation (Total STD), the within cluster Standard Deviation (Within STD) and the subsequent use of these in the formulae to get
</span></div>
<br />
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;">
Variable RSquare = 1- (within STD/Total STD)^2
</span></div>
<br />
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;">
and the <i>RSQ/(1-RSQ)</i> ratio as well. Although the exact SAS output upto 4 decimal places cannot be reproduced because the exact formulae used by SAS is not available anywhere on the internet, the ones used here are come close to the actual numbers and help us decide the strength of the individual variables.
</span></div>
<br />
<!-- HTML generated using hilite.me --><br />
<div style="background: #ffffff; border-width: .1em .1em .1em .8em; border: solid gray; overflow: auto; padding: .2em .6em; width: auto;">
<pre style="line-height: 125%; margin: 0;">sas_iris<span style="color: #333333;">$</span>classified<span style="color: #333333;"><-</span>sas_clus<span style="color: #333333;">$</span>cluster
variable.stats<span style="color: #333333;"><-</span><span style="color: #008800; font-weight: bold;">function</span>(v,classified){
tot<span style="color: #333333;"><-</span>sd(v)
wth<span style="color: #333333;"><-</span>sqrt(sum(tapply(v,classified,FUN <span style="color: #333333;">=</span> <span style="color: #008800; font-weight: bold;">function</span> (x) {sum((x<span style="color: #333333;">-</span>mean(x))<span style="color: #333333;">^</span><span style="color: #6600ee; font-weight: bold;">2</span>)})<span style="color: #333333;">/</span>(length(v)<span style="color: #333333;">-</span>unique(classified))))
RSq<span style="color: #333333;"><-</span><span style="color: #6600ee; font-weight: bold;">1</span><span style="color: #333333;">-</span>(wth<span style="color: #333333;">/</span>tot)<span style="color: #333333;">^</span><span style="color: #6600ee; font-weight: bold;">2</span>
Ratio<span style="color: #333333;"><-</span>RSq<span style="color: #333333;">/</span>(<span style="color: #6600ee; font-weight: bold;">1</span><span style="color: #333333;">-</span>RSq)
a<span style="color: #333333;"><-</span>c(tot,wth,RSq,Ratio)
a
}
vapply(X <span style="color: #333333;">=</span> sas_iris[,<span style="color: #6600ee; font-weight: bold;">2</span><span style="color: #333333;">:</span><span style="color: #6600ee; font-weight: bold;">5</span>],FUN <span style="color: #333333;">=</span> variable.stats, FUN.VALUE <span style="color: #333333;">=</span> c(Tot.STD<span style="color: #333333;">=</span><span style="color: #6600ee; font-weight: bold;">0</span>,Within.STD<span style="color: #333333;">=</span><span style="color: #6600ee; font-weight: bold;">0</span>,RSQ<span style="color: #333333;">=</span><span style="color: #6600ee; font-weight: bold;">0</span>,RSQRatio<span style="color: #333333;">=</span><span style="color: #6600ee; font-weight: bold;">0</span>),
classified<span style="color: #333333;">=</span>sas_iris<span style="color: #333333;">$</span>classified)
</pre>
</div>
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;">
And this will give the results in a tabular format, similar to SAS
</span></div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj4uDVEWabTW2HBuwvXZ4DDdq3nQcPwbSzZ5Q4B7R2-E6wBTFDyOHLI3zB6-sHxNy2UmV7z_03-gcCt73xI6t2w4k-huGUN_mUTd5EmahTl6szbToem8ce7mly0U2iOpm28dZH8NAqZ0UkA/s1600/kmeans_06.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="185" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj4uDVEWabTW2HBuwvXZ4DDdq3nQcPwbSzZ5Q4B7R2-E6wBTFDyOHLI3zB6-sHxNy2UmV7z_03-gcCt73xI6t2w4k-huGUN_mUTd5EmahTl6szbToem8ce7mly0U2iOpm28dZH8NAqZ0UkA/s640/kmeans_06.png" width="640" /></a></div>
<br />
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;">
By looking at this, we can now make out that the variable <i>‘PetalLength’</i> produces the highest degree of separation, while <i>‘SepalWidth’</i> has the least. So, if we were iteratively drop variables from the clustering, we would have to do away with the <i>‘SepalWidth’</i> variable first and so on.
</span></div>
<br />
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;">
The cluster means is already stored in the k-means result object. But if we want to generate the mean and standard deviations of each cluster, we can do it programmatically:
</span></div>
<br />
<!-- HTML generated using hilite.me --><br />
<div style="background: #ffffff; border-width: .1em .1em .1em .8em; border: solid gray; overflow: auto; padding: .2em .6em; width: auto;">
<pre style="line-height: 125%; margin: 0;"><span style="color: #888888;"># Cluster means and standard deviations </span>
<span style="color: #888888;"># SD </span>
sapply(X <span style="color: #333333;">=</span> sas_iris[,<span style="color: #6600ee; font-weight: bold;">2</span><span style="color: #333333;">:</span><span style="color: #6600ee; font-weight: bold;">5</span>],FUN <span style="color: #333333;">=</span> tapply,sas_iris<span style="color: #333333;">$</span>classified,sd)
<span style="color: #888888;"># Mean</span>
sapply(X <span style="color: #333333;">=</span> sas_iris[,<span style="color: #6600ee; font-weight: bold;">2</span><span style="color: #333333;">:</span><span style="color: #6600ee; font-weight: bold;">5</span>],FUN <span style="color: #333333;">=</span> tapply,sas_iris<span style="color: #333333;">$</span>classified,mean)
<span style="color: #888888;"># Mean is same as cluster centers</span>
sas_clus<span style="color: #333333;">$</span>centers
</pre>
</div>
<br />
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;">
to produce the result like this:
</span></div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEigVW183fqTrgLhWDuzwBeL_rZj8JEqMEy5BOjFzlEJItjS80MITusvhzTeTPg7AyuLB6btTwrJlrOnVnkymHxA_jEZXQzd6LAEmtJ6PmhqYq6VWamM7BwOuPSPqfsfeksAnSUWWXAxOCzz/s1600/kmeans_07.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="215" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEigVW183fqTrgLhWDuzwBeL_rZj8JEqMEy5BOjFzlEJItjS80MITusvhzTeTPg7AyuLB6btTwrJlrOnVnkymHxA_jEZXQzd6LAEmtJ6PmhqYq6VWamM7BwOuPSPqfsfeksAnSUWWXAxOCzz/s400/kmeans_07.png" width="400" /></a></div>
<br />
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;">
The Pseudo-F statistic can also be generated programmatically by using the formula:
</span></div>
<br />
<!-- HTML generated using hilite.me --><br />
<div style="background: #ffffff; border-width: .1em .1em .1em .8em; border: solid gray; overflow: auto; padding: .2em .6em; width: auto;">
<pre style="line-height: 125%; margin: 0;">pseudo_F <span style="color: #333333;"><-</span> (sas_clus<span style="color: #333333;">$</span>betweenss<span style="color: #333333;">/</span>(length(sas_clus<span style="color: #333333;">$</span>size)<span style="color: #6600ee; font-weight: bold;">-1</span>))<span style="color: #333333;">/</span>(sas_clus<span style="color: #333333;">$</span>tot.withinss<span style="color: #333333;">/</span>(sum(sas_clus<span style="color: #333333;">$</span>size)<span style="color: #333333;">-</span>length(sas_clus<span style="color: #333333;">$</span>size)))
</pre>
</div>
<br />
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;">
and the output would be:
</span></div>
<br />
<!-- HTML generated using hilite.me --><br />
<div style="background: #ffffff; border-width: .1em .1em .1em .8em; border: solid gray; overflow: auto; padding: .2em .6em; width: auto;">
<pre style="line-height: 125%; margin: 0;"><span style="color: #333333;">></span> pseudo_F
[<span style="color: #6600ee; font-weight: bold;">1</span>] <span style="color: #6600ee; font-weight: bold;">561.6278</span>
</pre>
</div>
<br />
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;">
As for the cubic clustering criterion and approximate overall R-Squared, the results which are displayed on SAS seem to be a closely guarded secret and hence it is not exactly available on the internet, nor reproducible exactly in R. However, to get the ccc (cubic clustering criterion) , we could use the package NbClust :
</span></div>
<br />
<!-- HTML generated using hilite.me --><br />
<div style="background: #ffffff; border-width: .1em .1em .1em .8em; border: solid gray; overflow: auto; padding: .2em .6em; width: auto;">
<pre style="line-height: 125%; margin: 0;">library(NbClust)
NbClust(data <span style="color: #333333;">=</span> sas_iris[,<span style="color: #6600ee; font-weight: bold;">2</span><span style="color: #333333;">:</span><span style="color: #6600ee; font-weight: bold;">5</span>],min.nc <span style="color: #333333;">=</span> <span style="color: #6600ee; font-weight: bold;">3</span>,max.nc <span style="color: #333333;">=</span> <span style="color: #6600ee; font-weight: bold;">3</span>,method <span style="color: #333333;">=</span> <span style="background-color: #fff0f0;">'kmeans'</span>,index <span style="color: #333333;">=</span> <span style="background-color: #fff0f0;">"ccc"</span>)
</pre>
</div>
<br />
<!-- HTML generated using hilite.me --><br />
<div style="background: #ffffff; border-width: .1em .1em .1em .8em; border: solid gray; overflow: auto; padding: .2em .6em; width: auto;">
<pre style="line-height: 125%; margin: 0;"><span style="color: #333333;">$</span>All.index
nc.Ward index.CCC
<span style="color: #6600ee; font-weight: bold;">3.00000</span> <span style="color: #6600ee; font-weight: bold;">37.67012</span>
</pre>
</div>
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;">
37.6 which is completely different from what the SAS output reports. However, there are a lot of arguments on which one is correct. Refer to the link here for more details. For now, let us just make peace with whatever is available on Nbclust and move forward.
</span></div>
<br />
<div align="center" class="MsoNormal" style="text-align: justify;">
<span style="color: #222222; font-size: 12.0pt; line-height: 115%; mso-bidi-font-size: 11.0pt;">
<b>
Visualization of k-means cluster results
</b>
</span></div>
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;">
An area where R has a definite edge over SAS is the visualization of the results. Although the ODS has improved the plotting options on SAS, R is way ahead when it comes to creating colorful plots. So, once we have a cluster solution, we can use the powerful visualization features in R to create pretty plots. The reference used for this section of the post can be found <a href="http://stats.stackexchange.com/questions/31083/how-to-produce-a-pretty-plot-of-the-results-of-k-means-cluster-analysis">here </a>
</span></div>
<br />
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;">
Using the code as mentioned in the above article, we can create pretty plots for the IRIS data results like this:
</span></div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi6JKpV-PP555-xD3vl-rkV9plopVgkJxeI7yWYsxYsr2Fq5N4SwJjVjWTiV2TNRKadHVOHuYDJ6k2temfoijLY3YTwjp2Xt_sEry63rLdVjEPlLLs1aHAKb-uNFayj9M4sDrq52TFhf3rj/s1600/kmeans_08.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="215" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi6JKpV-PP555-xD3vl-rkV9plopVgkJxeI7yWYsxYsr2Fq5N4SwJjVjWTiV2TNRKadHVOHuYDJ6k2temfoijLY3YTwjp2Xt_sEry63rLdVjEPlLLs1aHAKb-uNFayj9M4sDrq52TFhf3rj/s400/kmeans_08.png" width="400" /></a></div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgZ2oMAEnRh37h1IOtJ31NmvOqi6soPeCOrc8oyGRMQwWJ7GwZeVq6YDH5blRkehuWVB_wps15fMveiA3f3DK6j0yLv9G_WP87lNSJ7F3ytmUhckZVqc4qslkzxrv3ap3-vMUZD8uKQBIIR/s1600/kmeans_09.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="215" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgZ2oMAEnRh37h1IOtJ31NmvOqi6soPeCOrc8oyGRMQwWJ7GwZeVq6YDH5blRkehuWVB_wps15fMveiA3f3DK6j0yLv9G_WP87lNSJ7F3ytmUhckZVqc4qslkzxrv3ap3-vMUZD8uKQBIIR/s400/kmeans_09.png" width="400" /></a></div>
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;">
As we can see, these plots have used the principal components decomposition to generate a 2-d plot for the 3 clusters. <br />
A pair-wise plot can be created to confirm the strength of each variable like this:
</span></div>
<br />
<!-- HTML generated using hilite.me --><br />
<div style="background: #ffffff; border-width: .1em .1em .1em .8em; border: solid gray; overflow: auto; padding: .2em .6em; width: auto;">
<pre style="line-height: 125%; margin: 0;">pairs(x <span style="color: #333333;">=</span> sas_iris[,<span style="color: #6600ee; font-weight: bold;">2</span><span style="color: #333333;">:</span><span style="color: #6600ee; font-weight: bold;">5</span>], col<span style="color: #333333;">=</span>c(<span style="color: #6600ee; font-weight: bold;">1</span><span style="color: #333333;">:</span><span style="color: #6600ee; font-weight: bold;">3</span>)[sas_iris<span style="color: #333333;">$</span>classified])
</pre>
</div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEioi31yr8PfsZXzhFTZXK3suIoFccSTUonxEpz4Vt_Tso-QxbGvOUipfLSE_iXoR6FeOOIR0-YeNR6l6EgW0_H_yYlN4FaW0d3cNyzv0oCNuw-eRxHX_84kucnQfp6yli2covYBceaLvaFb/s1600/Kmeans_10.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="308" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEioi31yr8PfsZXzhFTZXK3suIoFccSTUonxEpz4Vt_Tso-QxbGvOUipfLSE_iXoR6FeOOIR0-YeNR6l6EgW0_H_yYlN4FaW0d3cNyzv0oCNuw-eRxHX_84kucnQfp6yli2covYBceaLvaFb/s640/Kmeans_10.png" width="640" /></a></div>
<br />
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;">
The output confirms that <i>PetalLength</i> has a very high separating power, the species ‘Setosa’ (colored green) has PetalLengths between 10-20, while Versicolor and Verginica(colored black and red respectively) have lengths from 30-70mm.
</span></div>
<br />
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;">
Notes: <br />
1. Although the results turn out to be similar on both the software in this example, there might cases when it will be impossible to match the results, as the internal implementations are way different. In some cases, even the cluster sizes will be lot different even when you run multiple iterations. <br />
2. A major factor that influences the results is the scaling of variables. It is always recommended to have variables on the same scale in order to arrive at optimal results.
</span></div>
<br /></div>
Shashiahttp://www.blogger.com/profile/01602809065610957096noreply@blogger.com4tag:blogger.com,1999:blog-4849113628830840898.post-5743444064385276602014-02-18T22:52:00.000+05:302014-02-18T23:01:14.710+05:30Binary Logistic regression: Fast Concordance<div dir="ltr" style="text-align: left;" trbidi="on">
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;">
This is a follow up to an earlier article on concordance in binary logistic regression. You can <a href="http://shashiasrblog.blogspot.in/2014/01/binary-logistic-regression-on-r.html">find the original article here</a>. In that post, I had compared between 2-3 different ways of computing concordance, discordance and ties while running a binary logistic regression model on R. And the conclusion was that the OptimizedConc was an accurate, yet fast way to get to concordance in R. In this post we cover the following topics:
</span></div>
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;">
- Function for Fast and accurate Concordance in logit models using R <br />
- Comparison of the fastConc function against other methods
</span></div>
<br />
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;">
My analyst friend wrote to me and complained that even the optimized function was not so optimized when it came to large datasets! It seems the data frame that he used had more than a million observations and the function always kept failing due to memory issues. It immediately occurred to me the culprit were the huge matrices which are created in the function. It creates 3 matrices (initialized with zeroes), each of which are of size (number of ones) * (number of zeros). So, if you had half a million ones and half a million zeroes in the dataset, you would need three matrices of (0.5M * 0.5M) each, even before the actual calculations in the ‘for’ loop begun.
</span></div>
<br />
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;">
As we sat and discussed about it, we knew that if we were to use this function on real data, the matrix allocations and the dual for-loops had to somehow be optimized. And being the geek that he is, my friend suggested an approach to reduce the number of ‘for’ loops from two to one. The function below, which I have called fastConc, reduces the number of ‘for’ loops to one and uses the native ‘subset’ feature in the loop to calculate the number of concordant and discordant pairs. It is one of the fastest functions which can give you exact concordance values and on performance side, it compares itself against the github code, which just gives approximate concordance values:
</span></div>
<br />
<!-- HTML generated using hilite.me --><div style="background: #ffffff; overflow:auto;width:auto;border:solid gray;border-width:.1em .1em .1em .8em;padding:.2em .6em;"><pre style="margin: 0; line-height: 125%"><span style="color: #888888">###########################################################</span>
<span style="color: #888888"># Function fastConc : for concordance, discordance, ties</span>
<span style="color: #888888"># The function returns Concordance, discordance, and ties</span>
<span style="color: #888888"># by taking a glm binomial model result as input.</span>
<span style="color: #888888"># It uses optimisation through subsetting</span>
<span style="color: #888888">###########################################################</span>
fastConc<span style="color: #333333"><-</span><span style="color: #008800; font-weight: bold">function</span>(model){
<span style="color: #888888"># Get all actual observations and their fitted values into a frame</span>
fitted<span style="color: #333333"><-</span>data.frame(cbind(model<span style="color: #333333">$</span>y,model<span style="color: #333333">$</span>fitted.values))
colnames(fitted)<span style="color: #333333"><-</span>c(<span style="background-color: #fff0f0">'respvar'</span>,<span style="background-color: #fff0f0">'score'</span>)
<span style="color: #888888"># Subset only ones</span>
ones<span style="color: #333333"><-</span>fitted[fitted[,<span style="color: #6600EE; font-weight: bold">1</span>]<span style="color: #333333">==</span><span style="color: #6600EE; font-weight: bold">1</span>,]
<span style="color: #888888"># Subset only zeros</span>
zeros<span style="color: #333333"><-</span>fitted[fitted[,<span style="color: #6600EE; font-weight: bold">1</span>]<span style="color: #333333">==</span><span style="color: #6600EE; font-weight: bold">0</span>,]
<span style="color: #888888"># Initialise all the values</span>
pairs_tested<span style="color: #333333"><-</span>nrow(ones)<span style="color: #333333">*</span>nrow(zeros)
conc<span style="color: #333333"><-</span><span style="color: #6600EE; font-weight: bold">0</span>
disc<span style="color: #333333"><-</span><span style="color: #6600EE; font-weight: bold">0</span>
<span style="color: #888888"># Get the values in a for-loop</span>
<span style="color: #008800; font-weight: bold">for</span>(i <span style="color: #008800; font-weight: bold">in</span> <span style="color: #6600EE; font-weight: bold">1</span><span style="color: #333333">:</span>nrow(ones))
{
conc<span style="color: #333333"><-</span>conc <span style="color: #333333">+</span> sum(ones[i,<span style="background-color: #fff0f0">"score"</span>]<span style="color: #333333">></span>zeros[,<span style="background-color: #fff0f0">"score"</span>])
disc<span style="color: #333333"><-</span>disc <span style="color: #333333">+</span> sum(ones[i,<span style="background-color: #fff0f0">"score"</span>]<span style="color: #333333"><</span>zeros[,<span style="background-color: #fff0f0">"score"</span>])
}
<span style="color: #888888"># Calculate concordance, discordance and ties</span>
concordance<span style="color: #333333"><-</span>conc<span style="color: #333333">/</span>pairs_tested
discordance<span style="color: #333333"><-</span>disc<span style="color: #333333">/</span>pairs_tested
ties_perc<span style="color: #333333"><-</span>(<span style="color: #6600EE; font-weight: bold">1</span><span style="color: #333333">-</span>concordance<span style="color: #333333">-</span>discordance)
<span style="color: #008800; font-weight: bold">return</span>(list(<span style="background-color: #fff0f0">"Concordance"</span><span style="color: #333333">=</span>concordance,
<span style="background-color: #fff0f0">"Discordance"</span><span style="color: #333333">=</span>discordance,
<span style="background-color: #fff0f0">"Tied"</span><span style="color: #333333">=</span>ties_perc,
<span style="background-color: #fff0f0">"Pairs"</span><span style="color: #333333">=</span>pairs_tested))
}
</pre></div>
<br/>
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;">
The output of the function is exactly similar to the OptimisedConc function and it returns the Concordance, Discordance, Ties, etc as ratios, than percentages, which can be easily changed.
</span></div>
<br />
<div class="MsoNormal">
<span style="font-family: "Cambria","serif"; font-size: 12.0pt; line-height: 115%; mso-ascii-theme-font: major-latin; mso-hansi-theme-font: major-latin;"><b>Performance
of the function</b><o:p></o:p></span></div>
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;">
Intuitively, the function fastConc() seems to do better on memory as related to the optimisedConc just because it stores the concordance and discordance values in a count variable than in big matrices. So, how do all these functions match up on time? To check, I used a dataset with 20,000 observations which had 2000 ones and 18000 zeros (very low response model, you might say). There would be a total of (18000 * 2000) <b>36,000,000</b> pairs which need to be tested. And these are results of the functions:
</span></div>
<br />
<!-- HTML generated using hilite.me --><div style="background: #ffffff; overflow:auto;width:auto;border:solid gray;border-width:.1em .1em .1em .8em;padding:.2em .6em;"><pre style="margin: 0; line-height: 125%"><span style="color: #333333">></span> system.time(bruteforce(logit_mod))
user system elapsed
<span style="color: #6600EE; font-weight: bold">4291.10</span> <span style="color: #6600EE; font-weight: bold">6.12</span> <span style="color: #6600EE; font-weight: bold">4479.85</span>
<span style="color: #333333">></span> system.time(OptimisedConc(logit_mod))
user system elapsed
<span style="color: #6600EE; font-weight: bold">221.98</span> <span style="color: #6600EE; font-weight: bold">0.45</span> <span style="color: #6600EE; font-weight: bold">223.69</span>
<span style="color: #333333">></span> system.time(fastConc(logit_mod))
user system elapsed
<span style="color: #6600EE; font-weight: bold">0.69</span> <span style="color: #6600EE; font-weight: bold">0.00</span> <span style="color: #6600EE; font-weight: bold">0.69</span>
</pre></div>
<br/>
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;">
As can be seen, bruteforce() took more than <b>an hour</b> to give me the concordance results! And I had almost given up when the system.time() function finally returned the value. OptimisedConc does lot better in terms of time <b>4 minutes</b>, it is pathetic in terms of memory utilization! The fastConc() gives me the same result <b>within a second</b>, thanks to the native functions being used, and it consumes negligible memory.
</span></div>
<br />
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;">
So, the verdict is clear. There will be lots of situations like this, where multiple things seem to work and produce the same output. However, it is always best to choose the one which uses native functions for its implementations rather than other data heavy or user defined functions. If you ran logistic regression in other tools like SAS, you would not even worry about the functions, because they have already implemented it using ready-made native functions, and hence they tend to be really oiptimised!. As for concordance in R, the fastConc() now becomes my go-to function everytime I run a glm() code because of its sheer efficiency. If you have had any situation where you’ve used non-native functions to accomplish a task, let me know in comments. I’ll be back with more posts soon. Till then, take care!
</span></div>
<br /></div>
Shashiahttp://www.blogger.com/profile/01602809065610957096noreply@blogger.com4tag:blogger.com,1999:blog-4849113628830840898.post-71554466057882191782014-01-30T13:53:00.000+05:302014-01-30T13:53:04.046+05:30Excel style VLookup and RangeLookup in R<div dir="ltr" style="text-align: left;" trbidi="on">
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;">
A friend of mine, also an R enthusiast, came to me with this task that he was doing as part of a larger activity. The task seemed quite simple – assigning a bin value to each row of a dataset based on the information in a lookup table which contained information on the bins:
</span></div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEippaltgwE0pbVUoqDLdGcoiYs-k8iw7uP04Ztkt2KjuZCKA_3BRKHlo5wwPE4LJH9AWNGaCMtkw7D8joZsGyK7dDX0c1546BS8_w9XDoGVBNbINEYAWFRq-9a7iHauwkx1L39Idks7UCdZ/s1600/data.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEippaltgwE0pbVUoqDLdGcoiYs-k8iw7uP04Ztkt2KjuZCKA_3BRKHlo5wwPE4LJH9AWNGaCMtkw7D8joZsGyK7dDX0c1546BS8_w9XDoGVBNbINEYAWFRq-9a7iHauwkx1L39Idks7UCdZ/s320/data.png" height="320" width="192" /></a>
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjNnSM9b-6E4Bcs5i8S3W8AB0B8DLuRUshWYNraujNR3DCDBDnL7jJaTF1KfvQwC5aAjaYzZNLX0iWRZumotljDbktadS51g70toI-lwGoyXCxiL1RFBU8XaKrwiR_2EYCcY1qyIKePjnzV/s1600/lookup.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjNnSM9b-6E4Bcs5i8S3W8AB0B8DLuRUshWYNraujNR3DCDBDnL7jJaTF1KfvQwC5aAjaYzZNLX0iWRZumotljDbktadS51g70toI-lwGoyXCxiL1RFBU8XaKrwiR_2EYCcY1qyIKePjnzV/s320/lookup.png" height="220" width="320" /></a></div>
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;">
Data table (a large table of about 20,000+ rows) contains a variable called ‘indep1’, the values of which range from -30 to 280. The information on the bins is contained in the lookup table. And the bin numbers are such that they are in the increasing order of the ‘min_value’/’max_value’. The required output would be something like this:
</span></div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg7jMVF8_1YqimJ6gup9qT48fpOOxkXPHrVuwKi51_Gh1gtQxHLV2MalXG7733OkillPdcyapq2M2V-5DU_z5b8w8sir8C9secDy8zTLF4vcuobbv7k3HC0xV_2084GAojTmm9gB38TnwkB/s1600/output.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg7jMVF8_1YqimJ6gup9qT48fpOOxkXPHrVuwKi51_Gh1gtQxHLV2MalXG7733OkillPdcyapq2M2V-5DU_z5b8w8sir8C9secDy8zTLF4vcuobbv7k3HC0xV_2084GAojTmm9gB38TnwkB/s320/output.png" /></a></div>
<br />
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;">
An extra column in the data table indicating which bin the indep1 belongs to. Just to take care of the details, in case the value of the variable is at the border (say row number 4 in this case), it should go into the higher bin (bin 9 instead of bin 8).
</span></div>
<br />
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;">
Seemed like a simple but an interesting puzzle to solve on R. This post covers the following topics:<br />
- Excel style Vlookup in R <br />
- Range lookup in R similar to Vlookup in Excel <br />
- Comparison among all the lookup functions
</span></div>
<br />
<div class="MsoNormal">
<b><span style="font-family: "Cambria","serif"; font-size: 12.0pt; line-height: 115%; mso-ascii-theme-font: major-latin; mso-hansi-theme-font: major-latin;">Lookup on R<o:p></o:p></span></b></div>
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;">
Due to paucity of time, the first solution which the friend had tried was the non-algorithmic <a href="http://www.webopedia.com/TERM/B/brute_force.html">brute-force</a> approach of iterating through all the bins (1 to 10) for all the 20,000 rows of data and assigning the bin numbers to the data table. Something like this:
</span></div>
<!-- HTML generated using hilite.me --><br />
<div style="background: #ffffff; border-width: .1em .1em .1em .8em; border: solid gray; overflow: auto; padding: .2em .6em; width: auto;">
<pre style="line-height: 125%; margin: 0;"><span style="color: #888888;">### Brute force method</span>
full_iterate_way<span style="color: #333333;"><-</span><span style="color: #008800; font-weight: bold;">function</span>(data,lookup){
data<span style="color: #333333;">$</span>bin_num<span style="color: #333333;"><-</span><span style="color: #6600ee; font-weight: bold;">0</span>
<span style="color: #008800; font-weight: bold;">for</span>(j <span style="color: #008800; font-weight: bold;">in</span> <span style="color: #6600ee; font-weight: bold;">1</span><span style="color: #333333;">:</span>nrow(lookup)){
minval <span style="color: #333333;"><-</span> lookup[j, <span style="background-color: #fff0f0;">"min_value"</span>]
maxval <span style="color: #333333;"><-</span> lookup[j, <span style="background-color: #fff0f0;">"max_value"</span>]
label <span style="color: #333333;"><-</span> lookup[j, <span style="background-color: #fff0f0;">"bin_num"</span>]
<span style="color: #008800; font-weight: bold;">for</span>(k <span style="color: #008800; font-weight: bold;">in</span> <span style="color: #6600ee; font-weight: bold;">1</span><span style="color: #333333;">:</span>nrow(data)){
<span style="color: #008800; font-weight: bold;">if</span>(data[k, <span style="background-color: #fff0f0;">"indep1"</span>] <span style="color: #333333;">>=</span> minval <span style="color: #333333;">&</span> data[k, <span style="background-color: #fff0f0;">"indep1"</span>] <span style="color: #333333;"><</span> maxval){
data[k, <span style="background-color: #fff0f0;">"bin_num"</span>] <span style="color: #333333;"><-</span> label
}
}
}
data
}
data_full<span style="color: #333333;"><-</span>full_iterate_way(data<span style="color: #333333;">=</span>data_table,lookup<span style="color: #333333;">=</span>lookup_table)
</pre>
</div>
<br />
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;">
The function iterates through all the 20,000 rows of the data table for 10 times to assign the bin value to the variables. It does what is needed. However, if you were a programmer who looked at the code, you would immediately have apprehensions about code performance when there are two-for loops. And thus began the programmer’s quest to find alternative faster codes which would do the same.
</span></div>
<br />
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;">
If you had the data in excel, you would immediately know that this thing can be achieved using the <a href="http://office.microsoft.com/en-in/excel-help/vlookup-function-HA102752820.aspx">RANGE LOOKUP </a> property of the powerful VLOOKUP function, as the lookup table is anyway in the increasing order of bins. What’s more, if you need a column other than bin_info (say bin_weight) to be on the data_table, it would be a matter of just changing the argument 3 in Vlookup to get the desired column. So, the first improvement to the brute force would be to replicate the Vlookup (range lookup instead of the exact lookup) on R. Something like this:
</span></div>
<!-- HTML generated using hilite.me --><br />
<div style="background: #ffffff; border-width: .1em .1em .1em .8em; border: solid gray; overflow: auto; padding: .2em .6em; width: auto;">
<pre style="line-height: 125%; margin: 0;">rngLookup<span style="color: #333333;"><-</span><span style="color: #008800; font-weight: bold;">function</span>(value, dataFrame,column){
retVal<span style="color: #333333;"><-</span>dataFrame[value<span style="color: #333333;"><</span>dataFrame[,<span style="background-color: #fff0f0;">"min_value"</span>],column][<span style="color: #6600ee; font-weight: bold;">1</span>]<span style="color: #6600ee; font-weight: bold;">-1</span>
<span style="color: #008800; font-weight: bold;">if</span>(is.na(retVal)){retVal<span style="color: #333333;"><-</span>nrow(dataFrame)}
retVal
}
lookup_way<span style="color: #333333;"><-</span><span style="color: #008800; font-weight: bold;">function</span>(data,lookup){
<span style="color: #008800; font-weight: bold;">for</span>(i <span style="color: #008800; font-weight: bold;">in</span> <span style="color: #6600ee; font-weight: bold;">1</span><span style="color: #333333;">:</span>nrow(data)){
data<span style="color: #333333;">$</span>bin_num[i]<span style="color: #333333;"><-</span>rngLookup(data[i,<span style="color: #6600ee; font-weight: bold;">2</span>],dataFrame<span style="color: #333333;">=</span>lookup,column<span style="color: #333333;">=</span><span style="color: #6600ee; font-weight: bold;">2</span>)
}
data
}
data_lookedup<span style="color: #333333;"><-</span>lookup_way(data_table,lookup<span style="color: #333333;">=</span>lookup_table)
</pre>
</div>
<br />
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;">
Since we have function to do the lookup, we can call it for every row of the data frame eliminating one ‘for’loop. ‘data_lookedup’ would now contain the same information as in ‘data_full’. Just for the record, replacing the ‘<’ sign in the lookup function with ‘==’ sign can give you the exact VLookup function of excel in R.
</span></div>
<br />
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;">
Although the performance slightly improved after using the lookup function, it is still not an optimal way of going about things in R. This is mainly because we have still stuck to the programming paradigm of looping instead of the powerful vectorization and subsetting capabilities that R offers. So, we explore further and arrive at the next code to do the same thing - the ubiquitous and powerful SQL:
</span></div>
<!-- HTML generated using hilite.me --><br />
<div style="background: #ffffff; border-width: .1em .1em .1em .8em; border: solid gray; overflow: auto; padding: .2em .6em; width: auto;">
<pre style="line-height: 125%; margin: 0;">library(sqldf)
sql_way<span style="color: #333333;"><-</span><span style="color: #008800; font-weight: bold;">function</span>(data,lookup){
data<span style="color: #333333;"><-</span>sqldf(<span style="background-color: #fff0f0;">"select A.*,B.bin_num from</span>
<span style="background-color: #fff0f0;"> data A left join lookup B </span>
<span style="background-color: #fff0f0;"> ON (A.indep1 >= B.min_value and A.indep1 < B.max_value)"</span>)
data
}
</pre>
</div>
<br />
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;">
The library <a href="http://cran.r-project.org/web/packages/sqldf/index.html">sqldf </a> allows SQL codes directly on R data frames and this is one of the most elegant and optimal solution which I have come across to do the range lookup on R. The improvements to the performance are very substantial, as can be seen in the summary in the performance section below. However, the only thing which made me explore further for an even better alternative was that I was not convinced that a language like R, which is acclaimed to be one of the best for statistical analysis did not have a native function to achieve this simple task. And then I stumbled upon this:
</span></div>
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;">
<a href="http://stat.ethz.ch/R-manual/R-patched/library/base/html/findInterval.html">findInterval</a>
</span></div>
<br />
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;">
How could I miss something as trivial and intuitive like this in the first place? It was sort of the perfect answer to the question we asked and it was as native as R itself! And so, here is the simple code which will do what we had been trying to achieve all along – the one which I would prefer to use when the bin numbers start from ‘1’ and go on upto ‘10’ as in the example above:
</span></div>
<!-- HTML generated using hilite.me --><br />
<div style="background: #ffffff; border-width: .1em .1em .1em .8em; border: solid gray; overflow: auto; padding: .2em .6em; width: auto;">
<pre style="line-height: 125%; margin: 0;">find_interval<span style="color: #333333;"><-</span><span style="color: #008800; font-weight: bold;">function</span>(data,lookup){
data<span style="color: #333333;">$</span>label<span style="color: #333333;"><-</span>findInterval(x<span style="color: #333333;">=</span>data<span style="color: #333333;">$</span>indep1,vec<span style="color: #333333;">=</span>lookup<span style="color: #333333;">$</span>min_value)
data
}
data_interval<span style="color: #333333;"><-</span>find_interval(data<span style="color: #333333;">=</span>data_table,lookup_table)
</pre>
</div>
<br />
<div class="MsoNormal">
<span style="font-family: "Cambria","serif"; font-size: 12.0pt; line-height: 115%; mso-ascii-theme-font: major-latin; mso-hansi-theme-font: major-latin;"><b>Comparison of the lookup functions</b><o:p></o:p></span></div>
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;">
We now have 4 functions which do the exact same thing, and just by looking at them, we can assume the latter 2 to be more elegant than the earlier ones. However, let us also use the system.time function to see how each one of them performs when run on a data frame of 25000 rows:
</span></div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEghne-PWuPtL8FvhW3d7nfJi16Fbuw48YxY4Hj5P-TDY64xMuDwBfAIA8n8tCXRnOfdI3TextuyAXhbuyoUACP48lZemiqYJeoK_HI_SQGXV4g7F5mqK6UBEr42t6GwMBKTkzWEgM0uzUqy/s1600/performance.png" imageanchor="1"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEghne-PWuPtL8FvhW3d7nfJi16Fbuw48YxY4Hj5P-TDY64xMuDwBfAIA8n8tCXRnOfdI3TextuyAXhbuyoUACP48lZemiqYJeoK_HI_SQGXV4g7F5mqK6UBEr42t6GwMBKTkzWEgM0uzUqy/s320/performance.png" height="130" width="400" /></a></div>
<br />
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;">
As expected, the brute force method will be the slowest owing to the double-for-loops, and the lookup has only decreased the time on a linear scale. The SQL and the native findInterval win hands down by exponentially bringing down the time taken to perform the same task. The 'dual for-loop' brute force approach took <b>37 seconds</b> to do the same thing, while the lookup just reduced it to <b>25 seconds </b>, only a fractional improvement. The SQL got it down to <b>0.2 seconds</b> and the findInterval did it in no time at all! The little overhead in the SQL as compared to the findInterval can be because of the Cartesian product table join it needs to perform.
</span></div>
<br />
<div class="MsoNormal">
<span style="font-family: "Cambria","serif"; font-size: 12.0pt; line-height: 115%; mso-ascii-theme-font: major-latin; mso-hansi-theme-font: major-latin;"><b>Concluding remarks</b><o:p></o:p></span></div>
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;">
1. Although all the functions above achieve the same result, there could be slight differences. Some of the functions might produce unexpected results when the variable value is at the extreme end of the lookup table. Say, if the variable value is 280 (the highest value), the brute force approach gives the bin_value of ‘0’ due to initialization and the SQL method gives a value of ‘NA’ because of join conditions not matching. However, the findInterval has no such problems because it anyway does the comparison only till the 9th bin and the 10th bin is anything greater than 230
</span></div>
<br />
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;">
2. The findInterval is not a complete lookup because it can return only numbers starting from 0/1 to the number of bins. Suppose in the above example, we also wanted to have the ‘bin_weight’ variable along with the ‘bin_num’ variable for all rows of indep1, then findInterval would not be able to achieve that, but there would be no such problem if we used the SQL method. Suppose we wanted to have the desired output (adding even the bin_weight column in output):
</span></div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjtxkJjzTxrCNfUy6HWL8cqsGs1oZykPP9i3nC7qjNEWYnTPQCjctkPmCGa0UiNO2NjJta_ROZP4hM7dWRwx5LKDRGVsEd3k_GcMejVwkXEQN_Krm5k-e56JTGyjLCHp1GUX9sPBJYtsg8u/s1600/output_2.png" imageanchor="1"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjtxkJjzTxrCNfUy6HWL8cqsGs1oZykPP9i3nC7qjNEWYnTPQCjctkPmCGa0UiNO2NjJta_ROZP4hM7dWRwx5LKDRGVsEd3k_GcMejVwkXEQN_Krm5k-e56JTGyjLCHp1GUX9sPBJYtsg8u/s320/output_2.png" /></a></div>
<br />
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;">
We could tweak the find_interval code to achieve this as well:
</span></div>
<!-- HTML generated using hilite.me --><br />
<div style="background: #ffffff; border-width: .1em .1em .1em .8em; border: solid gray; overflow: auto; padding: .2em .6em; width: auto;">
<pre style="line-height: 125%; margin: 0;">library(plyr)
find_interval<span style="color: #333333;"><-</span><span style="color: #008800; font-weight: bold;">function</span>(data,lookup){
data<span style="color: #333333;">$</span>bin_num<span style="color: #333333;"><-</span>findInterval(x<span style="color: #333333;">=</span>data<span style="color: #333333;">$</span>indep1,vec<span style="color: #333333;">=</span>lookup<span style="color: #333333;">$</span>min_value)
data<span style="color: #333333;"><-</span>join(x<span style="color: #333333;">=</span>data,y<span style="color: #333333;">=</span>lookup,by<span style="color: #333333;">=</span><span style="background-color: #fff0f0;">"bin_num"</span>)[,c(<span style="color: #6600ee; font-weight: bold;">1</span>,<span style="color: #6600ee; font-weight: bold;">2</span>,<span style="color: #6600ee; font-weight: bold;">3</span>,<span style="color: #6600ee; font-weight: bold;">7</span>)]
data
}
data_interval<span style="color: #333333;"><-</span>find_interval(data<span style="color: #333333;">=</span>data_table,lookup_table)
</pre>
</div>
<br />
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;">
The addition of the merge statement in 'find_interval' makes it almost similar to SQL in terms of performance and functionality, and now either of them can be used in place of the earlier, brute force approach.
</span></div>
<br />
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;">
Though this seemed like a very simple exercise, I found a lot of ways to do one particular thing in R while exploring this. The best part about R is that we still cannot conclude if this is the best way of getting the desired outcome. And purists may go ahead and suggest the use of merge using ‘data.table’ which seems to be way faster than regular merge, using the lookup function from library qdap, or some combination of match(), etc. However, if you find that the code which you have does what you expect it do without being too heavy on resources, you can continue using the thing which works rather than going for the kill on optimization. If you used any better ways to achieve the same result, you are most welcome to share it. And if you found this post useful, please let me know that as well. I’ll be back writing more on these simple yet thought provoking exercises. Have fun!
</span></div>
<br /></div>
Shashiahttp://www.blogger.com/profile/01602809065610957096noreply@blogger.com4tag:blogger.com,1999:blog-4849113628830840898.post-58301477753747246982014-01-15T15:11:00.000+05:302014-02-18T23:00:24.230+05:30Binary logistic Regression on R : Concordance and Discordance<div dir="ltr" style="text-align: left;" trbidi="on">
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;">
Logistic regression might not be the most trending in the analytics industry anymore. But is still bread and butter for most analytics folks, especially in the marketing decision sciences. Most of propensity models, survival analysis, churn measurement, etc are exclusively driven by this traditional yet powerful statistical technique.
</span></div>
<br />
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;">
A lot of material is available online to get started with building logistic regression models and getting the model fit criterion satisfied. If you are totally new to building logistic regression models, an excellent point to start off would be the <a href="http://www.ats.ucla.edu/stat/r/dae/logit.htm">UCLA help articles</a> on building these binary logit models. Even before getting to the model building stage, some of the pre-processing and variable selection procedures must be followed in order to get good results, which would be the subject of a separate post. In this post we will cover some of the important model fit measures like Concordance, discordance, and other association measures like Somers D, gamma and Kendall’s Tau A which compare the predicted responses to actual responses.
</span></div>
<br />
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;">
The following questions will be answered during the course of this article:</span><br />
<ul>
<li><span style="font-family: Arial, Helvetica, sans-serif;">Measures for logistic regression Concordance and discordance in R </span></li>
<li><span style="font-family: Arial, Helvetica, sans-serif;">Somers'D, Gamma, Kendall’s Tau-a statistics in R</span></li>
</ul>
</div>
<br />
<div class="MsoNormal">
<span style="background-color: white; color: #333333; font-family: Cambria, serif; line-height: 115%;"><b><span style="font-size: large;">
Concordance and Discordance in R
</span></b></span></div>
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;">
The most widely used code to run a logit model in R would be the glm() function with the ‘binomial’ variant. So, if you wanted to run a logistic regression model on the hypothetical dataset (available on the UCLS website <a href="http://www.ats.ucla.edu/stat/data/binary.csv">here</a>) , all you need to do is load the data set in R and run the binary logit using the following code:
</span></div>
<br />
<!-- HTML generated using hilite.me -->
<div style="background: #ffffff; border-width: .1em .1em .1em .8em; border: solid gray; overflow: auto; padding: .2em .6em; width: auto;">
<pre style="line-height: 125%; margin: 0;"><span style="color: #888888;"># Clear workspace objects</span>
rm(list<span style="color: #333333;">=</span>ls())
<span style="color: #888888;"># Load the modelling dataset into workspace</span>
model_data<span style="color: #333333;"><-</span>read.csv(<span style="background-color: #fff0f0;">'binary.csv'</span>,header<span style="color: #333333;">=</span><span style="color: #008800; font-weight: bold;">T</span>,sep<span style="color: #333333;">=</span><span style="background-color: #fff0f0;">','</span>,fill<span style="color: #333333;">=</span><span style="color: #008800; font-weight: bold;">T</span>)
<span style="color: #888888;"># Run a binary logistic regression model</span>
logit_mod<span style="color: #333333;"><-</span>glm(formula<span style="color: #333333;">=</span>admit<span style="color: #333333;">~</span>gre<span style="color: #333333;">+</span>gpa<span style="color: #333333;">+</span>rank,
family<span style="color: #333333;">=</span><span style="background-color: #fff0f0;">'binomial'</span>,data<span style="color: #333333;">=</span>model_data)
<span style="color: #888888;"># Display the summary</span>
summary(logit_mod)
</pre>
</div>
<br />
<span style="font-family: Arial, Helvetica, sans-serif;">
And this is how the model summary would look like:
</span><br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEidJLf33S5gS_6hJ64Sf3Xn6CJkaxnn7JF0raCdETbzD3UhzwplLigF1qOKB09XMkugAyoj1FlgcBJGFGAQ0LQKHwI07OKt-GHIjNXp1gNwovxUSKzIF7s-pI7VHi6ZM6qWXEtMsETrzk29/s1600/4.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="427" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEidJLf33S5gS_6hJ64Sf3Xn6CJkaxnn7JF0raCdETbzD3UhzwplLigF1qOKB09XMkugAyoj1FlgcBJGFGAQ0LQKHwI07OKt-GHIjNXp1gNwovxUSKzIF7s-pI7VHi6ZM6qWXEtMsETrzk29/s640/4.png" width="640" /></a></div>
<br />
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;">
Since all the co-efficients are significant and the residual deviance has reduced as compared to the null deviance, we can conclude that we have a fair model. But, looking at the model result this way, it would be really difficult to say how well this model performs. In OLS regression, the R-squared and its more refined measure adjusted R-square would be the ‘one-stop’ metric which would immediately tell us if the model was a good fit or not. And since this was a value between 0 and 1, we could easily change it to a percentage value and pass it off as ‘model accuracy’ for beginners and the not-so-much-math-oriented businesses. Unfortunately, looking at adj-R square would be totally irrelevant in case of logistic regression because we model the log odds ratio and it becomes very difficult in terms of explain ability
</span></div>
<br />
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;">
This is where concordance steps in to help. Concordance tells us the association between actual values and the values fitted by the model in percentage terms. Concordance is defined as the ratio of number of pairs where the 1 had a higher model score than the model score of zero to the total number of 1-0 pairs possible. A higher value for concordance (60-70%) means a better fitted model. However, a very large value for concordance (85-95%) could also suggest that the model is over-fitted and needs to be re-aligned to explain the entire population.
</span></div>
<br />
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;">
A straight-forward, non-optimal, brute-force approach to getting to concordance would be to write the following code after building the model:
</span></div>
<br />
<!-- HTML generated using hilite.me -->
<div style="background: #ffffff; border-width: .1em .1em .1em .8em; border: solid gray; overflow: auto; padding: .2em .6em; width: auto;">
<pre style="line-height: 125%; margin: 0;"><span style="color: #888888;">###########################################################</span>
<span style="color: #888888;"># Function Bruteforce : for concordance, discordance, ties</span>
<span style="color: #888888;"># The function returns Concordance, discordance, and ties</span>
<span style="color: #888888;"># by taking a glm binomial model result as input.</span>
<span style="color: #888888;"># It uses the brute force method of two for-loops</span>
<span style="color: #888888;">###########################################################</span>
bruteforce<span style="color: #333333;"><-</span><span style="color: #008800; font-weight: bold;">function</span>(model){
<span style="color: #888888;"># Get all actual observations and their fitted values into a frame</span>
fitted<span style="color: #333333;"><-</span>data.frame(cbind(model<span style="color: #333333;">$</span>y,model<span style="color: #333333;">$</span>fitted.values))
colnames(fitted)<span style="color: #333333;"><-</span>c(<span style="background-color: #fff0f0;">'respvar'</span>,<span style="background-color: #fff0f0;">'score'</span>)
<span style="color: #888888;"># Subset only ones</span>
ones<span style="color: #333333;"><-</span>fitted[fitted[,<span style="color: #6600ee; font-weight: bold;">1</span>]<span style="color: #333333;">==</span><span style="color: #6600ee; font-weight: bold;">1</span>,]
<span style="color: #888888;"># Subset only zeros</span>
zeros<span style="color: #333333;"><-</span>fitted[fitted[,<span style="color: #6600ee; font-weight: bold;">1</span>]<span style="color: #333333;">==</span><span style="color: #6600ee; font-weight: bold;">0</span>,]
<span style="color: #888888;"># Initialise all the values</span>
pairs_tested<span style="color: #333333;"><-</span><span style="color: #6600ee; font-weight: bold;">0</span>
conc<span style="color: #333333;"><-</span><span style="color: #6600ee; font-weight: bold;">0</span>
disc<span style="color: #333333;"><-</span><span style="color: #6600ee; font-weight: bold;">0</span>
ties<span style="color: #333333;"><-</span><span style="color: #6600ee; font-weight: bold;">0</span>
<span style="color: #888888;"># Get the values in a for-loop</span>
<span style="color: #008800; font-weight: bold;">for</span>(i <span style="color: #008800; font-weight: bold;">in</span> <span style="color: #6600ee; font-weight: bold;">1</span><span style="color: #333333;">:</span>nrow(ones))
{
<span style="color: #008800; font-weight: bold;">for</span>(j <span style="color: #008800; font-weight: bold;">in</span> <span style="color: #6600ee; font-weight: bold;">1</span><span style="color: #333333;">:</span>nrow(zeros))
{
pairs_tested<span style="color: #333333;"><-</span>pairs_tested<span style="color: #6600ee; font-weight: bold;">+1</span>
<span style="color: #008800; font-weight: bold;">if</span>(ones[i,<span style="color: #6600ee; font-weight: bold;">2</span>]<span style="color: #333333;">></span>zeros[j,<span style="color: #6600ee; font-weight: bold;">2</span>]) {conc<span style="color: #333333;"><-</span>conc<span style="color: #6600ee; font-weight: bold;">+1</span>}
<span style="color: #008800; font-weight: bold;">else</span> <span style="color: #008800; font-weight: bold;">if</span>(ones[i,<span style="color: #6600ee; font-weight: bold;">2</span>]<span style="color: #333333;">==</span>zeros[j,<span style="color: #6600ee; font-weight: bold;">2</span>]){ties<span style="color: #333333;"><-</span>ties<span style="color: #6600ee; font-weight: bold;">+1</span>}
<span style="color: #008800; font-weight: bold;">else</span> {disc<span style="color: #333333;"><-</span>disc<span style="color: #6600ee; font-weight: bold;">+1</span>}
}
}
<span style="color: #888888;"># Calculate concordance, discordance and ties</span>
concordance<span style="color: #333333;"><-</span>conc<span style="color: #333333;">/</span>pairs_tested
discordance<span style="color: #333333;"><-</span>disc<span style="color: #333333;">/</span>pairs_tested
ties_perc<span style="color: #333333;"><-</span>ties<span style="color: #333333;">/</span>pairs_tested
<span style="color: #008800; font-weight: bold;">return</span>(list(<span style="background-color: #fff0f0;">"Concordance"</span><span style="color: #333333;">=</span>concordance,
<span style="background-color: #fff0f0;">"Discordance"</span><span style="color: #333333;">=</span>discordance,
<span style="background-color: #fff0f0;">"Tied"</span><span style="color: #333333;">=</span>ties_perc,
<span style="background-color: #fff0f0;">"Pairs"</span><span style="color: #333333;">=</span>pairs_tested))
}
</pre>
</div>
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;">
All this code does is to iterate through each and every 1-0 pair to see if the model score of ‘1’ was greater than the model score of ‘0’. And based on this comparison, it classifies the pair as a concordant pair, discordant pair or a tied pair. The final values for concordance, discordance and ties are expressed as a percentage of the total number of the pairs tested. When this code is run, we see the following output on the console:
</span></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj8lFWXe-7YdABUAziT8GmJHHaVX7vpeSfBxpOGg9ukqBybR-KCTreORQX3wXfdzux3DOwqtd95lAcu0tmFZaGo5SMWnwANJEmvRx2nPEbZTV1SCGIRfoyJvGBtAiJPQZKa6-XeSdANzDa_/s1600/5.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="238" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj8lFWXe-7YdABUAziT8GmJHHaVX7vpeSfBxpOGg9ukqBybR-KCTreORQX3wXfdzux3DOwqtd95lAcu0tmFZaGo5SMWnwANJEmvRx2nPEbZTV1SCGIRfoyJvGBtAiJPQZKa6-XeSdANzDa_/s320/5.png" width="320" /></a></div>
<br />
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;">
As can be seen, the model reports a concordance percentage of 69.2% which tells us that the model is fairly accurate.
</span></div>
<br />
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;">
Although the above code gets the job done, it can be a real burden on system resources because of the two ‘for-loops’ and no optimization done at all. So, as the modelling data set increases in size, using this function can sometimes lead to a heavy toll on system resources, long waiting time and sometimes, crashing the R-process altogether.
</span></div>
<br />
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;">
Alternatively, the following function which is provided by a fellow blogger Vaibhav <a href="http://statour.blogspot.in/2012/12/concordance-and-discordance-in-logistic.html">here </a> can be used which uses the power of vectorization in R and gives the same result by using less computation time. The code for the same is (originally posted at the above link):
</span></div>
<!-- HTML generated using hilite.me --><br />
<div style="background: #ffffff; border-width: .1em .1em .1em .8em; border: solid gray; overflow: auto; padding: .2em .6em; width: auto;">
<pre style="line-height: 125%; margin: 0;"><span style="color: #888888;">###########################################################</span>
<span style="color: #888888;"># Function OptimisedConc : for concordance, discordance, ties</span>
<span style="color: #888888;"># The function returns Concordance, discordance, and ties</span>
<span style="color: #888888;"># by taking a glm binomial model result as input.</span>
<span style="color: #888888;"># Although it still uses two-for loops, it optimises the code</span>
<span style="color: #888888;"># by creating initial zero matrices</span>
<span style="color: #888888;">###########################################################</span>
OptimisedConc<span style="color: #333333;">=</span><span style="color: #008800; font-weight: bold;">function</span>(model)
{
Data <span style="color: #333333;">=</span> cbind(model<span style="color: #333333;">$</span>y, model<span style="color: #333333;">$</span>fitted.values)
ones <span style="color: #333333;">=</span> Data[Data[,<span style="color: #6600ee; font-weight: bold;">1</span>] <span style="color: #333333;">==</span> <span style="color: #6600ee; font-weight: bold;">1</span>,]
zeros <span style="color: #333333;">=</span> Data[Data[,<span style="color: #6600ee; font-weight: bold;">1</span>] <span style="color: #333333;">==</span> <span style="color: #6600ee; font-weight: bold;">0</span>,]
conc<span style="color: #333333;">=</span>matrix(<span style="color: #6600ee; font-weight: bold;">0</span>, dim(zeros)[<span style="color: #6600ee; font-weight: bold;">1</span>], dim(ones)[<span style="color: #6600ee; font-weight: bold;">1</span>])
disc<span style="color: #333333;">=</span>matrix(<span style="color: #6600ee; font-weight: bold;">0</span>, dim(zeros)[<span style="color: #6600ee; font-weight: bold;">1</span>], dim(ones)[<span style="color: #6600ee; font-weight: bold;">1</span>])
ties<span style="color: #333333;">=</span>matrix(<span style="color: #6600ee; font-weight: bold;">0</span>, dim(zeros)[<span style="color: #6600ee; font-weight: bold;">1</span>], dim(ones)[<span style="color: #6600ee; font-weight: bold;">1</span>])
<span style="color: #008800; font-weight: bold;">for</span> (j <span style="color: #008800; font-weight: bold;">in</span> <span style="color: #6600ee; font-weight: bold;">1</span><span style="color: #333333;">:</span>dim(zeros)[<span style="color: #6600ee; font-weight: bold;">1</span>])
{
<span style="color: #008800; font-weight: bold;">for</span> (i <span style="color: #008800; font-weight: bold;">in</span> <span style="color: #6600ee; font-weight: bold;">1</span><span style="color: #333333;">:</span>dim(ones)[<span style="color: #6600ee; font-weight: bold;">1</span>])
{
<span style="color: #008800; font-weight: bold;">if</span> (ones[i,<span style="color: #6600ee; font-weight: bold;">2</span>]<span style="color: #333333;">></span>zeros[j,<span style="color: #6600ee; font-weight: bold;">2</span>])
{conc[j,i]<span style="color: #333333;">=</span><span style="color: #6600ee; font-weight: bold;">1</span>}
<span style="color: #008800; font-weight: bold;">else</span> <span style="color: #008800; font-weight: bold;">if</span> (ones[i,<span style="color: #6600ee; font-weight: bold;">2</span>]<span style="color: #333333;"><</span>zeros[j,<span style="color: #6600ee; font-weight: bold;">2</span>])
{disc[j,i]<span style="color: #333333;">=</span><span style="color: #6600ee; font-weight: bold;">1</span>}
<span style="color: #008800; font-weight: bold;">else</span> <span style="color: #008800; font-weight: bold;">if</span> (ones[i,<span style="color: #6600ee; font-weight: bold;">2</span>]<span style="color: #333333;">==</span>zeros[j,<span style="color: #6600ee; font-weight: bold;">2</span>])
{ties[j,i]<span style="color: #333333;">=</span><span style="color: #6600ee; font-weight: bold;">1</span>}
}
}
Pairs<span style="color: #333333;">=</span>dim(zeros)[<span style="color: #6600ee; font-weight: bold;">1</span>]<span style="color: #333333;">*</span>dim(ones)[<span style="color: #6600ee; font-weight: bold;">1</span>]
PercentConcordance<span style="color: #333333;">=</span>(sum(conc)<span style="color: #333333;">/</span>Pairs)<span style="color: #333333;">*</span><span style="color: #6600ee; font-weight: bold;">100</span>
PercentDiscordance<span style="color: #333333;">=</span>(sum(disc)<span style="color: #333333;">/</span>Pairs)<span style="color: #333333;">*</span><span style="color: #6600ee; font-weight: bold;">100</span>
PercentTied<span style="color: #333333;">=</span>(sum(ties)<span style="color: #333333;">/</span>Pairs)<span style="color: #333333;">*</span><span style="color: #6600ee; font-weight: bold;">100</span>
<span style="color: #008800; font-weight: bold;">return</span>(list(<span style="background-color: #fff0f0;">"Percent Concordance"</span><span style="color: #333333;">=</span>PercentConcordance,<span style="background-color: #fff0f0;">"Percent Discordance"</span><span style="color: #333333;">=</span>PercentDiscordance,<span style="background-color: #fff0f0;">"Percent Tied"</span><span style="color: #333333;">=</span>PercentTied,<span style="background-color: #fff0f0;">"Pairs"</span><span style="color: #333333;">=</span>Pairs))
}
</pre>
</div>
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;">
This code also does the same thing as above but using matrices already initialized with zeroes. The output and the measures for concordance,etc are exactly the same as in the bruteforce approach. So, the toll on system resources would be much lesser as compared to the earlier code, because it has taken the power of R into consideration. Now, just for the sake of comparison, let us just see what is the savings in terms of system resources by looking at the time taken to execute the two functions. We use the system.time() function to evaluate the time:
</span></div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhgmWWJ9a1075GMWB7e0l_5KGYvIVVEHE7UT4iN6SdEn4EWPnKrqBRIqyP7PnidxhMtj4p2nbv9sYDtEcC5ALQYk7-6Yy68BbVoZ1WbJcAiUEvCc0a-W0WloUUplSqlXXbHmOe4K6_ps_UZ/s1600/6.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="110" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhgmWWJ9a1075GMWB7e0l_5KGYvIVVEHE7UT4iN6SdEn4EWPnKrqBRIqyP7PnidxhMtj4p2nbv9sYDtEcC5ALQYk7-6Yy68BbVoZ1WbJcAiUEvCc0a-W0WloUUplSqlXXbHmOe4K6_ps_UZ/s320/6.png" width="320" /></a></div>
<br />
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;">
The second function does the same thing as the first using only 10% of the time! That is what vectorization can do in R.
</span></div>
<br />
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;">
Of course, there are other functions which can be written which will approximate the value of Concordance instead of calculating accurately using all the possible 1-0 pairs. One of the most frequently returned search URL when you search for Concordance is the following link at <a href="https://gist.github.com/inkhorn/2151594"> GITHUB </a>. This code is even better in terms of performance as compared to the optimized function above, but the only catch is that it is not accurate. It has approximated the number of 1-0 pairs on the assumption that the data usually has as many number of ones as there are zeroes. If you calculate the concordance of the above model using this function, this is what you get:
</span></div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhSVBESCgn-Pfgf_mJBWgWr6D2UpwmBRfc1S6d8Iv-gnMOEh_HToz5OnjfkCMSVFPpHz8f3n9zLFYeDnGJpNYha97G_r2r7c72F4dA3Gz1bht1XxD614KvIqEG42PZfM5SsTsOWlUSQ6BFn/s1600/7.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="300" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhSVBESCgn-Pfgf_mJBWgWr6D2UpwmBRfc1S6d8Iv-gnMOEh_HToz5OnjfkCMSVFPpHz8f3n9zLFYeDnGJpNYha97G_r2r7c72F4dA3Gz1bht1XxD614KvIqEG42PZfM5SsTsOWlUSQ6BFn/s320/7.png" width="320" /></a></div>
<br />
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;">
The code has given a better value for Concordance (70.8%) instead of the actual value (69.2%). However this might get totally inaccurate if we had sorted the data to have all top scoring ones at the top of our data set, in which case Concordance would reach an unusually high value. The only thing about this code is that it is very quick, and can be used to get an approximate idea of what range the actual concordance would lie. And it does not even take a second to do that! My vote would still be for the OptimisedConc function.
</span></div>
<br />
<div class="MsoNormal">
<span style="background-color: white; color: #333333; font-family: Cambria, serif; line-height: 115%;"><b><span style="font-size: large;">
Somers D, Gamma, Kendall’s Tau-a statistics in R
</span></b></span></div>
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;">
Once the total number of pairs, concordant pairs, tied pairs and discordant pairs are obtained, then calculation of the above statistics is pretty easy and straight forward. Gamma (more famous as Goodman and Kruskal Gamma) is the measure of association in a doubly ordered contingency table. Refer <a href="http://www.stat-d.si/mz/mz8.1/goktas.pdf">here for more info </a>. It can be calculated as:
</span></div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjiKnx6UrDiOyC_mkFK3_8bYDPCp7kyxmqileEs35N-U8xXVm5PWijcVcBV0wdZJJJpN8B95910Fq96_mNpzIkonFRhxcdqeFvPmhWpS8dx_ooKZf1kaVlYalebQQ4hpBSHvIQP3UxjI7m_/s1600/1.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjiKnx6UrDiOyC_mkFK3_8bYDPCp7kyxmqileEs35N-U8xXVm5PWijcVcBV0wdZJJJpN8B95910Fq96_mNpzIkonFRhxcdqeFvPmhWpS8dx_ooKZf1kaVlYalebQQ4hpBSHvIQP3UxjI7m_/s320/1.png" /></a></div>
<br />
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;">
where P is the number of concordant pairs and Q is the number of discordant pairs and ‘T’ is the number of tied pairs. It is a measure of how well the model is able to distinguish between concordant pairs and compared to the discordant pairs.
</span></div>
<br />
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;">
Somers’D is almost similar to gamma, but however takes does not into account the tied number of pairs. So, usually, if there are tied pairs in the model, Somers’D is usually less than gamma and can be calculated as
</span></div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhz1GosGoSYFB8bjAUo3Yo4sjH-NuWtdFeC_iWChTFjZPen9NatHvrBZXDLGp-M44QgLqTDNLiA9HvTQ5HWhVONi-ERPzKcGmNavVc4r6zI5vNymBW23mH7i2a_PjL5SyLtNLiY-rwx2IE4/s1600/2.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhz1GosGoSYFB8bjAUo3Yo4sjH-NuWtdFeC_iWChTFjZPen9NatHvrBZXDLGp-M44QgLqTDNLiA9HvTQ5HWhVONi-ERPzKcGmNavVc4r6zI5vNymBW23mH7i2a_PjL5SyLtNLiY-rwx2IE4/s320/2.png" /></a></div>
<br />
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;">
Both Gamma and Somers’D have values ranging from zero to one and the higher value of them indicates better distinguishing ability for the model.
</span></div>
<br />
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;">
Kendall’s tau-a is one more measure of association in the model. It can be computed using the following formula:
</span></div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjHEphE3XUOF0tI1TtY770Sl3Ams9CThH2zOZG-VQBfMgmK9Mm8vJ8ULvfLMsIe4r-iGeDUc0T4bN41VRGzdXGagpTPNQ3SPl7wRTG2Zo4Eau-xQL01M-e5xpo1ZXXtA4p3mA4FEMMz886D/s1600/3.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjHEphE3XUOF0tI1TtY770Sl3Ams9CThH2zOZG-VQBfMgmK9Mm8vJ8ULvfLMsIe4r-iGeDUc0T4bN41VRGzdXGagpTPNQ3SPl7wRTG2Zo4Eau-xQL01M-e5xpo1ZXXtA4p3mA4FEMMz886D/s320/3.png" /></a></div>
<br />
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;">
Where N is the total number of observations in the model. It is again a value between 0 and 1, however, for any given model, Kendall’s tau would be much lesser than gamma or SomersD because Tau-A takes all possible pairs as the denominator while the others take only the 1-0 pairs in the denominator.
</span></div>
<br />
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;">
Once we know these definitions, we can modify the above function OptimisedConc to return even these values by adding the following lines of code just before the return statement like this:
</span></div>
<br />
<!-- HTML generated using hilite.me --><br />
<div style="background: #ffffff; border-width: .1em .1em .1em .8em; border: solid gray; overflow: auto; padding: .2em .6em; width: auto;">
<pre style="line-height: 125%; margin: 0;">PercentConcordance<span style="color: #333333;">=</span>(sum(conc)<span style="color: #333333;">/</span>Pairs)<span style="color: #333333;">*</span><span style="color: #6600ee; font-weight: bold;">100</span>
PercentDiscordance<span style="color: #333333;">=</span>(sum(disc)<span style="color: #333333;">/</span>Pairs)<span style="color: #333333;">*</span><span style="color: #6600ee; font-weight: bold;">100</span>
PercentTied<span style="color: #333333;">=</span>(sum(ties)<span style="color: #333333;">/</span>Pairs)<span style="color: #333333;">*</span><span style="color: #6600ee; font-weight: bold;">100</span>
N<span style="color: #333333;"><-</span>length(model<span style="color: #333333;">$</span>y)
gamma<span style="color: #333333;"><-</span>(sum(conc)<span style="color: #333333;">-</span>sum(disc))<span style="color: #333333;">/</span>Pairs
Somers_D<span style="color: #333333;"><-</span>(sum(conc)<span style="color: #333333;">-</span>sum(disc))<span style="color: #333333;">/</span>(Pairs<span style="color: #333333;">-</span>sum(ties))
k_tau_a<span style="color: #333333;"><-</span><span style="color: #6600ee; font-weight: bold;">2</span><span style="color: #333333;">*</span>(sum(conc)<span style="color: #333333;">-</span>sum(disc))<span style="color: #333333;">/</span>(N<span style="color: #333333;">*</span>(N<span style="color: #6600ee; font-weight: bold;">-1</span>))
<span style="color: #008800; font-weight: bold;">return</span>(list(<span style="background-color: #fff0f0;">"Percent Concordance"</span><span style="color: #333333;">=</span>PercentConcordance,
<span style="background-color: #fff0f0;">"Percent Discordance"</span><span style="color: #333333;">=</span>PercentDiscordance,
<span style="background-color: #fff0f0;">"Percent Tied"</span><span style="color: #333333;">=</span>PercentTied,
<span style="background-color: #fff0f0;">"Pairs"</span><span style="color: #333333;">=</span>Pairs,
<span style="background-color: #fff0f0;">"Gamma"</span><span style="color: #333333;">=</span>gamma,
<span style="background-color: #fff0f0;">"Somers D"</span><span style="color: #333333;">=</span>Somers_D,
<span style="background-color: #fff0f0;">"Kendall's Tau A"</span><span style="color: #333333;">=</span>k_tau_a))
</pre>
</div>
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;">
And the call to the function would return:
</span></div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEisThAyMaqiUuYCsqrFLH-UPElrL71Qaw4Cyvn-T9-NKEL-tj-Xi7or3oKK1EoqPLGfELGXvq0fPsM-2e-025nk_h2zooGNaxGgvYe5e-8FR7jPQ2FrP0bWHHHIMVkCJYQiRtCO8rP8BEmv/s1600/8.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="320" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEisThAyMaqiUuYCsqrFLH-UPElrL71Qaw4Cyvn-T9-NKEL-tj-Xi7or3oKK1EoqPLGfELGXvq0fPsM-2e-025nk_h2zooGNaxGgvYe5e-8FR7jPQ2FrP0bWHHHIMVkCJYQiRtCO8rP8BEmv/s320/8.png" width="291" /></a></div>
<br />
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;">
This post covered one of the practical considerations to be taken into account while running predictive models using R. In the upcoming posts, I plan to cover some of the ways the above outputs can be beautified using html and some of the other practical considerations while modeling on R. If you liked this post/found it useful, you can give me a thumbs up using comment/likes. I’ll be back with more on these areas of predictive modeling soon. Till then, happy modeling :)
</span></div>
<br />
<div class="MsoNormal">
<span style="font-family: "Cambria","serif"; font-size: 12.0pt; line-height: 115%; mso-ascii-theme-font: major-latin; mso-hansi-theme-font: major-latin;"><b>Update: 18 Feb 2014</b><o:p></o:p></span></div>
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;">
A follow-up to this article has been published today. Although the OptimisedConc works well to save time, it is very poor in terms of memory utilization. And hence, a better function named as 'fastConc' has been written which makes use of the native functionality. <br/>
You can find the new article and the function <a href="http://shashiasrblog.blogspot.in/2014/02/binary-logistic-regression-fast.html">on this link.</a>
</span></div>
<br /></div>
Shashiahttp://www.blogger.com/profile/01602809065610957096noreply@blogger.com7tag:blogger.com,1999:blog-4849113628830840898.post-36018858001288636972013-10-16T20:57:00.000+05:302013-10-18T12:29:33.378+05:30VBA front end for R<div dir="ltr" style="text-align: left;" trbidi="on">
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;">
If you work in the analytics industry, I am sure you would have had this debate sometime or the other – <i>pros and cons of R</i>. While everyone agrees that R is quite powerful and has great graphics, most of us, especially those who have worked on GUI based tools like SASEG, etc agree that the text output of R can be pretty verbose. A colleague of mine ran a linear model and immediately exclaimed ‘it looks so bland!’
</span></div>
<br />
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;">
This set me exploring ways to beautify R – I found some interesting packages which would help in formatting the output – you can check out prettyR and html converter packages which can do wonders to the plain text output in R. However, my requirements were a little customized. We used excel in most of our day to day activities and VBA is quite powerful in parsing/formatting the results. So, why not use Excel and VBA to create a beautiful front end to run R? It could be a macro enabled tool which will read input from an excel sheet, run the regression code using RScript and display the formatted output on excel. Well, turns out that I was able to do all that and even more – this post explains the findings of my endeavor:
</span></div>
<br />
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;">
The following questions will be answered during the course of this article:<br />
- How to run an RScript through MS Excel using VBA?<br />
- How to run an RScript through command prompt? [in WINDOWS]<br />
- How to pass arguments to an RScript through command line/external code? [in WINDOWS]<br />
- How to read plain text files in MS Excel using VBA? [obviously WINDOWS :) ]<br />
</span></div>
<br />
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;">
Just so that we know that our commands are executed correctly, let us write the following simple R code and save it in our directory ‘C:\R_code’ as ‘hello.R’
</span></div>
<!-- HTML generated using hilite.me --><br />
<div style="background: #ffffff; border-width: .1em .1em .1em .8em; border: solid gray; overflow: auto; padding: .2em .6em; width: auto;">
<pre style="line-height: 125%; margin: 0;">Contents of ‘C<span style="color: #333333;">:</span>\R_code\hello.R’
<span style="color: #888888;"># Prints output to console</span>
cat(<span style="background-color: #fff0f0;">'Hello World'</span>)
var1<span style="color: #333333;"><-</span><span style="color: #6600ee; font-weight: bold;">5</span><span style="color: #333333;">^</span><span style="color: #6600ee; font-weight: bold;">3</span>
var2<span style="color: #333333;"><-</span><span style="color: #6600ee; font-weight: bold;">7</span><span style="color: #333333;">^</span><span style="color: #6600ee; font-weight: bold;">3</span>
cat(<span style="background-color: #fff0f0;">'\nThe result of adding'</span>,var1,<span style="background-color: #fff0f0;">'to'</span>,var2,<span style="background-color: #fff0f0;">'is'</span>,var1<span style="color: #333333;">+</span>var2)
</pre>
</div>
<br />
<div style="text-align: justify;">
<b>Running RScript through command prompt:</b>
</div>
<br />
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;">
To be able to run R code through command prompt/other applications, you need to have the path of ‘R.exe’ and ‘RScript.exe’ in your system variable PATH. You can do this easily if you have admin rights to your system. Check <a href="http://www.nextofwindows.com/how-to-addedit-environment-variables-in-windows-7/">this link</a> to know how to do it on WINDOWS7. However, if you don’t have admin rights and want to add something to the PATH variable, don’t worry – you can easily add this to the USER variable PATH. Here are the steps on how to do this:
</span></div>
<br />
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;">
1. Suppose you have the ‘R.exe’ and ‘RScript.exe’ installed in the following directory: ‘C:\Program Files\R\R-2.15.3\bin\x64’. Copy this path to your clipboard.<br />
2. Go to ‘Computer’ -- > Properties<br />
3. On the left pane, click on ‘Advanced system settings’ <br />
4. On the ‘System properties’ dialog that opens up, navigate to the ‘Advanced’ tab and click on ‘Environment variables…’<br />
</span></div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj0-050CJFwnsE0QVn-iCYgnZrf89uwVyznGWR7JMvezfh4QUejsZYHbEe6sYt4d1Ywn36upO3QM3JBRjPQn_dsMiyVZYkLYVUL0qcQlE1331xCNfjCX74QOWR9gpyZ4VHdDI7TZKTPLub_/s1600/Env_variables.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj0-050CJFwnsE0QVn-iCYgnZrf89uwVyznGWR7JMvezfh4QUejsZYHbEe6sYt4d1Ywn36upO3QM3JBRjPQn_dsMiyVZYkLYVUL0qcQlE1331xCNfjCX74QOWR9gpyZ4VHdDI7TZKTPLub_/s320/Env_variables.png" /></a></div>
<br />
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;">
5. On the ‘User variables’ click on ‘New…’
</span></div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgYLgBSjTGNFlEH2d63p-ONseWBgultoMR6fxZeU0PZxiUzHNu_Crvz3qhRUhBIkvdhqBxMyj3c1T8QEC9ROwqGYBKeFAc3BojCiQk2PBe0dEOpuvLe0LVyLRoWzZpMtWlofYMgHBhNm_UM/s1600/new.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgYLgBSjTGNFlEH2d63p-ONseWBgultoMR6fxZeU0PZxiUzHNu_Crvz3qhRUhBIkvdhqBxMyj3c1T8QEC9ROwqGYBKeFAc3BojCiQk2PBe0dEOpuvLe0LVyLRoWzZpMtWlofYMgHBhNm_UM/s320/new.png" /></a></div>
<br />
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;">
6. In the field ‘Variable Name:’, type PATH<br />
7. In the field ‘Variable Value:’, paste the clipboard value, ie ‘C:\Program Files\R\R-2.15.3\bin\x64’. Add a semicolon ‘;’ after that. <br />
8. Click on ‘Ok’ as many times to dismiss all dialog boxes.<br />
9. Open command prompt and type ‘Rscript’ and hit ENTER. You will see the following:<br />
</span></div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh4zSNf8gipRpSgKSkNPSFYT_J7t0AmjNGG_-sjJ84HRtAKIf0UgoEL0JD4JRj1xSRcasvYr1F2sWd5i5YmJwfTlai1FTEnvnIOgtxvhKjOzSkmAp46tFLsNHjZiDHhFtTIR9GtKLsTmXx_/s1600/RscriptCmd.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh4zSNf8gipRpSgKSkNPSFYT_J7t0AmjNGG_-sjJ84HRtAKIf0UgoEL0JD4JRj1xSRcasvYr1F2sWd5i5YmJwfTlai1FTEnvnIOgtxvhKjOzSkmAp46tFLsNHjZiDHhFtTIR9GtKLsTmXx_/s320/RscriptCmd.png" /></a></div>
<br />
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;">
Now that you have Rscript on your path, you can run R code from any directory on your system, including applications like MS-Excel through VBA. Just repeat step 9 by passing any *.R file as argument with the full path and it will execute as expected:
</span></div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgAImkVOVVj54B1Y9El3Om-FU1O7GnU8lRbJ3rThiY01w3Eyh2DxuBjs9KneNW7VbJb_Gy4ea0zBagLnsBtjn8rjrKxA9qlRKpiBIEtWJ76c0iKCZXE7xiWD3_akPVVETum9x6SK6sUyldg/s1600/RScriptexecute.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgAImkVOVVj54B1Y9El3Om-FU1O7GnU8lRbJ3rThiY01w3Eyh2DxuBjs9KneNW7VbJb_Gy4ea0zBagLnsBtjn8rjrKxA9qlRKpiBIEtWJ76c0iKCZXE7xiWD3_akPVVETum9x6SK6sUyldg/s320/RScriptexecute.png" /></a></div>
<br />
<div style="text-align: justify;">
<b>Running RScript through VBA:</b>
</div>
<br />
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;">
To run this code in MS-Excel using VBA, you need to open a macro enabled worksheet (*.xlsm). To create a new macro enabled sheet, just create a new workbook and click on ‘Save As..’ and save as ‘Excel Macro-Enabled Workbook (*.xlsm)’. Once you have a macro-enabled workbook open, press the shortcut key combination ‘ALT + F11’ to open up the VBA editor. Once that is done, right click on the ‘Project Explorer’ to create a new module (which will be Module1 by default) and then type the following VBA code:
</span></div>
<!-- HTML generated using hilite.me --><br />
<div style="background: #ffffff; border-width: .1em .1em .1em .8em; border: solid gray; overflow: auto; padding: .2em .6em; width: auto;">
<pre style="line-height: 125%; margin: 0;"><span style="color: #008800; font-weight: bold;">Sub</span> <span style="color: #0066bb; font-weight: bold;">RunRscript</span>()
<span style="color: #888888;">'runs an external R code through Shell</span>
<span style="color: #888888;">'The location of the RScript is 'C:\R_code'</span>
<span style="color: #888888;">'The script name is 'hello.R'</span>
<span style="color: #008800; font-weight: bold;">Dim</span> shell <span style="color: black; font-weight: bold;">As</span> <span style="color: #333399; font-weight: bold;">Object</span>
<span style="color: #008800; font-weight: bold;">Set</span> shell <span style="color: #333333;">=</span> VBA.CreateObject(<span style="background-color: #fff0f0;">"WScript.Shell"</span>)
<span style="color: #008800; font-weight: bold;">Dim</span> waitTillComplete <span style="color: black; font-weight: bold;">As</span> <span style="color: #333399; font-weight: bold;">Boolean</span>: waitTillComplete <span style="color: #333333;">=</span> <span style="color: #008800; font-weight: bold;">True</span>
<span style="color: #008800; font-weight: bold;">Dim</span> style <span style="color: black; font-weight: bold;">As</span> <span style="color: #333399; font-weight: bold;">Integer</span>: style <span style="color: #333333;">=</span> <span style="color: #0000dd; font-weight: bold;">1</span>
<span style="color: #008800; font-weight: bold;">Dim</span> errorCode <span style="color: black; font-weight: bold;">As</span> <span style="color: #333399; font-weight: bold;">Integer</span>
<span style="color: #008800; font-weight: bold;">Dim</span> path <span style="color: black; font-weight: bold;">As</span> <span style="color: #333399; font-weight: bold;">String</span>
path <span style="color: #333333;">=</span> <span style="background-color: #fff0f0;">"RScript C:\R_code\hello.R"</span>
errorCode <span style="color: #333333;">=</span> shell.Run(path, style, waitTillComplete)
<span style="color: #008800; font-weight: bold;">End</span> <span style="color: #008800; font-weight: bold;">Sub</span>
</pre>
</div>
<br />
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;">
If you look at the VBA code carefully, it creates a Windows Shell object and invokes the R command through the shell. Also, the advantage of using Wscript.shell is that you can get VBA to wait till the execution is finished. To get more information on how to run a macro or use the VBA editor, you can refer to a lot of online tutorials that are easily available. A good place to start would be the MSDN tutorial which you can find <a href="http://msdn.microsoft.com/en-us/library/ee814737(v=office.14).aspx">here</a>.
</span></div>
<br />
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;">
When you run this macro, you can see that a command window opens up, executes something and closes. But how do you know if the code has actually executed? A good way to redirect the console output on the R code to a file. You can use this by the <a href="http://stat.ethz.ch/R-manual/R-devel/library/base/html/sink.html">sink</a> function in R. Here is the modified R code which accomplishes the same:
</span></div>
<!-- HTML generated using hilite.me --><br />
<div style="background: #ffffff; border-width: .1em .1em .1em .8em; border: solid gray; overflow: auto; padding: .2em .6em; width: auto;">
<pre style="line-height: 125%; margin: 0;"><span style="color: #888888;"># Re-directs the console output to a file 'hello.txt'</span>
<span style="color: #888888;"># The file is created in the directory 'C:\R_code'</span>
sink(<span style="background-color: #fff0f0;">'C:/R_code/hello.txt'</span>,append<span style="color: #333333;">=</span><span style="color: #008800; font-weight: bold;">F</span>,type<span style="color: #333333;">=</span><span style="background-color: #fff0f0;">"output"</span>)
cat(<span style="background-color: #fff0f0;">'Hello World'</span>)
var1<span style="color: #333333;"><-</span><span style="color: #6600ee; font-weight: bold;">5</span><span style="color: #333333;">^</span><span style="color: #6600ee; font-weight: bold;">3</span>
var2<span style="color: #333333;"><-</span><span style="color: #6600ee; font-weight: bold;">7</span><span style="color: #333333;">^</span><span style="color: #6600ee; font-weight: bold;">3</span>
cat(<span style="background-color: #fff0f0;">'\nThe result of adding'</span>,var1,<span style="background-color: #fff0f0;">'to'</span>,var2,<span style="background-color: #fff0f0;">'is'</span>,var1<span style="color: #333333;">+</span>var2)
sink(<span style="color: #008800; font-weight: bold;">NULL</span>)
</pre>
</div>
<br />
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;">
Once you’ve run the VBA macro, browse to C:\R_code and check if the ‘hello.txt’ has been created or not. If you can find the file there, then congratulations! You have run successfully used VBA to execute an R script.
</span></div>
<br />
<div style="text-align: justify;">
<b>Passing arguments to an RScript through command line/VBA:</b>
</div>
<br />
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;">
Most of the work which we do requires us to pass inputs/parameters to a tool at runtime. In the code above, let’s say, we wanted ‘var1’ and ‘var2’ to be passed during runtime instead of being hardcoded the way they are right now. Let us create a simple excel tool which accepts two numbers and adds them, the front end would look like this:
</span></div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhV9nNzPYrjHSbziNUkZIxs_x6JQ9qKSPsd7VpMjWABnkc8GRF8r8e1GIk1I3IxjESeusNhOFNSUNoid62APPlFmyHh0dsOEhzU8iT9prQ8j6JjksXQLZd3C7gQB15bcOFzMBxn31e70MO7/s1600/Adder_front_end.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhV9nNzPYrjHSbziNUkZIxs_x6JQ9qKSPsd7VpMjWABnkc8GRF8r8e1GIk1I3IxjESeusNhOFNSUNoid62APPlFmyHh0dsOEhzU8iT9prQ8j6JjksXQLZd3C7gQB15bcOFzMBxn31e70MO7/s320/Adder_front_end.png" /></a></div>
<br />
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;">
Now, the only change in the VBA code would be to read inputs from cells D5 and F5 and pass it on to the RScript. The modified code would look like:
</span></div>
<!-- HTML generated using hilite.me --><br />
<div style="background: #ffffff; border-width: .1em .1em .1em .8em; border: solid gray; overflow: auto; padding: .2em .6em; width: auto;">
<pre style="line-height: 125%; margin: 0;"><span style="color: #008800; font-weight: bold;">Sub</span> <span style="color: #0066bb; font-weight: bold;">RunRscript</span>()
<span style="color: #888888;">'runs an external R code through Shell</span>
<span style="color: #888888;">'The location of the RScript is 'C:\R_code'</span>
<span style="color: #888888;">'The script name is 'hello.R'</span>
<span style="color: #008800; font-weight: bold;">Dim</span> shell <span style="color: black; font-weight: bold;">As</span> <span style="color: #333399; font-weight: bold;">Object</span>
<span style="color: #008800; font-weight: bold;">Set</span> shell <span style="color: #333333;">=</span> VBA.CreateObject(<span style="background-color: #fff0f0;">"WScript.Shell"</span>)
<span style="color: #008800; font-weight: bold;">Dim</span> waitTillComplete <span style="color: black; font-weight: bold;">As</span> <span style="color: #333399; font-weight: bold;">Boolean</span>: waitTillComplete <span style="color: #333333;">=</span> <span style="color: #008800; font-weight: bold;">True</span>
<span style="color: #008800; font-weight: bold;">Dim</span> style <span style="color: black; font-weight: bold;">As</span> <span style="color: #333399; font-weight: bold;">Integer</span>: style <span style="color: #333333;">=</span> <span style="color: #0000dd; font-weight: bold;">1</span>
<span style="color: #008800; font-weight: bold;">Dim</span> errorCode <span style="color: black; font-weight: bold;">As</span> <span style="color: #333399; font-weight: bold;">Integer</span>
<span style="color: #008800; font-weight: bold;">Dim</span> var1, var2 <span style="color: black; font-weight: bold;">As</span> <span style="color: #333399; font-weight: bold;">Double</span>
var1 <span style="color: #333333;">=</span> Sheet1.Range(<span style="background-color: #fff0f0;">"D5"</span>).Value
var2 <span style="color: #333333;">=</span> Sheet1.Range(<span style="background-color: #fff0f0;">"F5"</span>).Value
<span style="color: #008800; font-weight: bold;">Dim</span> path <span style="color: black; font-weight: bold;">As</span> <span style="color: #333399; font-weight: bold;">String</span>
path <span style="color: #333333;">=</span> <span style="background-color: #fff0f0;">"RScript C:\R_code\hello.R "</span> <span style="color: #333333;">&</span> var1 <span style="color: #333333;">&</span> <span style="background-color: #fff0f0;">" "</span> <span style="color: #333333;">&</span> var2
errorCode <span style="color: #333333;">=</span> shell.Run(path, style, waitTillComplete)
<span style="color: #008800; font-weight: bold;">End</span> <span style="color: #008800; font-weight: bold;">Sub</span>
</pre>
</div>
<br />
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;">
Now, the VBA code is ready to pass two extra parameters to the Rscript and get it executed. But, the change on the input side means we will also have to change the R code to accept the input parameters and process them. This can be accomplished very well using the commandArgs function in R which will read the arguments and store it as a vector. The code changes as below:
</span></div>
<!-- HTML generated using hilite.me --><br />
<div style="background: #ffffff; border-width: .1em .1em .1em .8em; border: solid gray; overflow: auto; padding: .2em .6em; width: auto;">
<pre style="line-height: 125%; margin: 0;"><span style="color: #888888;"># Accepts two numbers and adds them</span>
<span style="color: #888888;"># Re-directs the console output to a file 'hello.txt'</span>
<span style="color: #888888;"># The file is created in the directory 'C:\R_code'</span>
args<span style="color: #333333;"><-</span>commandArgs(trailingOnly<span style="color: #333333;">=</span><span style="color: #008800; font-weight: bold;">T</span>)
<span style="color: #888888;"># cat(paste(args,collapse="\n"))</span>
sink(<span style="background-color: #fff0f0;">'C:/R_code/hello.txt'</span>,append<span style="color: #333333;">=</span><span style="color: #008800; font-weight: bold;">F</span>,type<span style="color: #333333;">=</span><span style="background-color: #fff0f0;">"output"</span>)
cat(<span style="background-color: #fff0f0;">'Hello World'</span>)
var1<span style="color: #333333;"><-</span>as.numeric(args[<span style="color: #6600ee; font-weight: bold;">1</span>])
var2<span style="color: #333333;"><-</span>as.numeric(args[<span style="color: #6600ee; font-weight: bold;">2</span>])
cat(<span style="background-color: #fff0f0;">'\nThe result of adding'</span>,var1,<span style="background-color: #fff0f0;">'to'</span>,var2,<span style="background-color: #fff0f0;">'is'</span>,var1<span style="color: #333333;">+</span>var2)
sink(<span style="color: #008800; font-weight: bold;">NULL</span>)
</pre>
</div>
<br />
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;">
Note the use of the ‘trailingOnly=T’ option in the commandArgs function. This would make the args vector store only those arguments which are passed by the USER. In addition to the USER arguments, RScript passes some system arguments by default. If you are interests in modifying those (like the directory of the R file, etc), then you would probably keep the trailingOnly argument to FALSE.
</span></div>
<br />
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;">
You now learnt how to invoke R from excel and how to pass data between R and excel. You can build on these two functionalities to develop some cool stuff which use Excel as front end and R as the backend. By the use of packages<a href="http://www.r-bloggers.com/read-excel-files-from-r/"> like ‘xlsx’</a> which can create data frames from excel sheets, you can go on to build so many applications like these:
</span></div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjgfNS0nGtyOQlznyDMnkneJKGveTh4KCUz_ji4n9q19o1PcE9CAdDIk5G4YuuCq9cAl7JSz3hfDT3H-4CJsawQBbMl1ZQnQoqCUv0X565MCNF4dENo09mbZbv5hBV_zdSwvM9rpU9GGno2/s1600/Regression.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjgfNS0nGtyOQlznyDMnkneJKGveTh4KCUz_ji4n9q19o1PcE9CAdDIk5G4YuuCq9cAl7JSz3hfDT3H-4CJsawQBbMl1ZQnQoqCUv0X565MCNF4dENo09mbZbv5hBV_zdSwvM9rpU9GGno2/s320/Regression.png" /><br /><span style="background-color: #f3f3f3;">Regression analysis tool can read input data from Excel and build OLS on R</span></a></div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjdslY0VbKzpcMf30W4yCPriN31H96HBkwLwFjFamIyRZxmV9QnPUhhG-e6_VBGzhXBMDg5kdmZJF0xwVwZ8gi_wj_GwL2EUEsjmsuQiD_SxEP8JtU7r1c8-HgwMW9MHMisOvnqRbCKTDze/s1600/Cluster.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjdslY0VbKzpcMf30W4yCPriN31H96HBkwLwFjFamIyRZxmV9QnPUhhG-e6_VBGzhXBMDg5kdmZJF0xwVwZ8gi_wj_GwL2EUEsjmsuQiD_SxEP8JtU7r1c8-HgwMW9MHMisOvnqRbCKTDze/s320/Cluster.png" /><br /><span style="background-color: #cccccc;">K-Means Cluster tool</span></a></div>
<br />
<div style="text-align: justify;">
<b>Reading text/picture files in MS Excel using VBA:</b>
</div>
<br />
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;">
Although this part does not contain any R codes, I am posting this for the sake of completeness. This way, you will have one complete tool to play with. Once you have the output of R in a text file/picture file, you can read it back into Excel using VBA and display the nicely formatted result in excel. This part will be particularly useful if you want to create a tool that reads data from excel, does some statistical analysis using R in the backend and then displays the summary of the analysis. Here is the VBA code you can use to parse through a text file:
</span></div>
<!-- HTML generated using hilite.me --><br />
<div style="background: #ffffff; border-width: .1em .1em .1em .8em; border: solid gray; overflow: auto; padding: .2em .6em; width: auto;">
<pre style="line-height: 125%; margin: 0;"><span style="background-color: #ffaaaa; color: red;">‘</span>this code will read from a file <span style="background-color: #ffaaaa; color: red;">‘</span>hello.txt<span style="background-color: #ffaaaa; color: red;">’</span> <span style="color: black; font-weight: bold;">and</span> store the result Sheet2 starting from range A1 <span style="color: black; font-weight: bold;">in</span> consecutive rows
<span style="color: #008800; font-weight: bold;">Dim</span> sFile <span style="color: black; font-weight: bold;">As</span> <span style="color: #333399; font-weight: bold;">String</span>
sFile <span style="color: #333333;">=</span> <span style="background-color: #fff0f0;">"C:\R_code\hello.txt"</span>
<span style="color: #008800; font-weight: bold;">Dim</span> rowNum <span style="color: black; font-weight: bold;">As</span> <span style="color: #333399; font-weight: bold;">Integer</span>
rowNum <span style="color: #333333;">=</span> <span style="color: #0000dd; font-weight: bold;">1</span>
<span style="color: #008800; font-weight: bold;">Set</span> dest <span style="color: #333333;">=</span> Sheet2.Cells(rowNum, <span style="color: #0000dd; font-weight: bold;">1</span>)
Open sFile <span style="color: #008800; font-weight: bold;">For</span> Input <span style="color: black; font-weight: bold;">As</span> #<span style="color: #0000dd; font-weight: bold;">1</span>
<span style="color: #008800; font-weight: bold;">Do</span> Until EOF(<span style="color: #0000dd; font-weight: bold;">1</span>)
Input #<span style="color: #0000dd; font-weight: bold;">1</span>, ReadData
<span style="color: #008800; font-weight: bold;">If</span> <span style="color: #008800; font-weight: bold;">Not</span> IsEmpty(ReadData) <span style="color: #008800; font-weight: bold;">Then</span>
dest.Cells <span style="color: #333333;">=</span> ReadData
rowNum <span style="color: #333333;">=</span> rowNum <span style="color: #333333;">+</span> <span style="color: #0000dd; font-weight: bold;">1</span>
<span style="color: #008800; font-weight: bold;">Set</span> dest <span style="color: #333333;">=</span> Sheet1.Cells(rowNum, <span style="color: #0000dd; font-weight: bold;">1</span>)
<span style="color: #008800; font-weight: bold;">End</span> <span style="color: #008800; font-weight: bold;">If</span>
<span style="color: #008800; font-weight: bold;">Loop</span>
Close #<span style="color: #0000dd; font-weight: bold;">1</span> <span style="color: #888888;">'close the opened file</span>
</pre>
</div>
<br />
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;">
And the below code can be used to copy pictures into VBA:
</span></div>
<!-- HTML generated using hilite.me --><br />
<div style="background: #ffffff; border-width: .1em .1em .1em .8em; border: solid gray; overflow: auto; padding: .2em .6em; width: auto;">
<pre style="line-height: 125%; margin: 0;"><span style="background-color: #ffaaaa; color: red;">‘</span>Inserts a picture located <span style="color: black; font-weight: bold;">in</span> R_code into Sheet2 at position A1 onwards
Sheet2.Range(<span style="background-color: #fff0f0;">"$A$1"</span>).Select
<span style="color: #008800; font-weight: bold;">Dim</span> sFile <span style="color: black; font-weight: bold;">As</span> <span style="color: #333399; font-weight: bold;">String</span>
sFile <span style="color: #333333;">=</span> <span style="background-color: #ffaaaa; color: red;">“</span>C:<span style="color: #333333;">\</span>R_code<span style="color: #333333;">\</span>mypicture1.jpg<span style="background-color: #fff0f0;">"</span>
<span style="background-color: #fff0f0;">ActiveSheet.Pictures.Insert(sFile) _</span>
<span style="background-color: #fff0f0;"> .Select</span>
<span style="background-color: #fff0f0;">Selection.ShapeRange.Height = 324</span>
<span style="background-color: #fff0f0;">Selection.ShapeRange.Width = 396</span>
<span style="background-color: #fff0f0;"> With Selection.ShapeRange.Line</span>
<span style="background-color: #fff0f0;"> .Visible = msoTrue</span>
<span style="background-color: #fff0f0;"> .ForeColor.ObjectThemeColor = msoThemeColorText1</span>
<span style="background-color: #fff0f0;"> .ForeColor.TintAndShade = 0</span>
<span style="background-color: #fff0f0;"> .ForeColor.Brightness = 0</span>
<span style="background-color: #fff0f0;"> .Transparency = 0</span>
<span style="background-color: #fff0f0;"> End With</span>
</pre>
</div>
<br />
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;">
So, with just little bit of code to format your results, you can get nicely formatted results in the way that you want. Below is the sample of the output from a linear regression model showing model accuracy, beta coefficients (from text file) and residual plots (from picture):
</span></div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgRDXYVJV7O6uah5gJjQGBmwzANqkBUYxHxKLCP-yoD0QAbUaqpkD01XKyj_vtneah86_WgVPyPLg9fGnpcdYhE0C8mKaeG-D7hrgVT3DKdrbQPyW8XGTodX8wsduxfBt8WbyUCwfTaNgbd/s1600/neatRegression.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="264" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgRDXYVJV7O6uah5gJjQGBmwzANqkBUYxHxKLCP-yoD0QAbUaqpkD01XKyj_vtneah86_WgVPyPLg9fGnpcdYhE0C8mKaeG-D7hrgVT3DKdrbQPyW8XGTodX8wsduxfBt8WbyUCwfTaNgbd/s640/neatRegression.png" width="640" /></a></div>
<br />
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;">
This is just the beginning. Once you have integrated R with VBA and vice versa, there is no limit to what you can achieve. R can be a powerful backend for computations where excel fails and I am sure we all agree that excel is still the de-facto standard for sharing and displaying summary reports. By using the interface techniques mentioned in this post, you can make the two of these complement each other very well. I would encourage you to try this out and let me know your thoughts in the comments below. If you like this post, then please follow this blog for more interesting posts, and tell your friends too :)
</span></div>
<br /></div>
<br />
<br />
<br />
<!-- Place this Facebook Standard like code Where you want to see -->
<div class="fb-like" data-send="true" data-width="450" data-show-faces="false">
</div>
<!-- End of Facebook Standard like code -->
Shashiahttp://www.blogger.com/profile/01602809065610957096noreply@blogger.com49tag:blogger.com,1999:blog-4849113628830840898.post-20600147325600509782013-09-29T02:10:00.001+05:302013-10-18T12:31:46.631+05:30Hello R World! <div dir="ltr" style="text-align: left;" trbidi="on">
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;">Okay, it has been quite a while since the grand opening to this blog a month ago. Without wasting too much time, let us get to business right away:
</span></div><br/>
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;">
To start with R, let us begin with the basic question which has puzzled many a learner when they are introduced to language R : 'what should be an approach to go about learning it?’ or simply put ‘where do I start?’
</span></div>
<div style="text-align: justify;">
<br />
<ul>
<li><span style="font-family: Arial, Helvetica, sans-serif;">A novice college student in an basic statistics course would want to think of R as an advanced calculator </span></li>
<li><span style="font-family: Arial, Helvetica, sans-serif;">A statistician would want to think of R as an optimizer which would give the ‘best-fit’ model for all the observed data points </span></li>
<li><span style="font-family: Arial, Helvetica, sans-serif;">Switchers from STATA/SPSS/SAS would want to see it as a replacement to their current software which they were so good at using (<i>How do I get this R to do PROC SQL/PROC REG?</i>’) </span></li>
</ul>
</div>
<br />
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;">
While none of these perspectives on R are wrong, a programmer’s perspective would be to treat as an object oriented interpreted language, and understand the basic programming constructs and the programming environment, which is what I present to you in this post.
</span></div>
<br />
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;">
So, the first thing you do is to get the environment for coding set up on your system by downloading and installing R – a lot of support and documentation is available for the same, and let us skip that portion in this post.
</span></div>
<br />
<span style="font-family: Arial, Helvetica, sans-serif; text-align: justify;">Another thing which a programmer would be particularly interested in would be in the IDE for development. If you downloaded the base-R from CRAN website, you would have already gotten a basic GUI for R – with a console to type out commands and execute them line by line, along with a simple editor where you can write lines of code and execute them together. This would look something like this:</span><br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi1Y9w8KaBzRgZMF-3T54PvqGiYdCamOx3xrp2YMkwKkEdP-zikM00bkYN8BzTjPxUQsyevOinc_GIgaUBAmT2wi6Eio0OAp7qPgP4ve_eFTisEijHLNdeE53yBfVi3S8JQQvZ1z7vvN3lw/s1600/RGUI.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="347" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi1Y9w8KaBzRgZMF-3T54PvqGiYdCamOx3xrp2YMkwKkEdP-zikM00bkYN8BzTjPxUQsyevOinc_GIgaUBAmT2wi6Eio0OAp7qPgP4ve_eFTisEijHLNdeE53yBfVi3S8JQQvZ1z7vvN3lw/s640/RGUI.png" width="640" /></a></div>
<br />
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;">
Going by the pace that R has caught up in the programming/analytics world, it was imperative that an IDE was needed and sometime around 2011 came RStudio – an open source IDE for R. I have been using it for over a year now and found it to be pretty useful – editor, graphics, console and the workspace information… all integrated into a single easy-to-use interface. RStudio has grown from strength to strength and it is now very popular among R users worldwide:
</span><br />
<span style="font-family: Arial, Helvetica, sans-serif;"><br /></span></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEggkO3hNHsdxIpL01HiSUOPqWmwo3PnoTGkyNTGDo3lYx4F_mXAVI2vEU50wYgGTJlqcyxDSXuqtQiMSBUSzLdZbRubSsBnvrDVA39DTeIFPtrtIpFbuxL_GNWP03KN4XXjS87xY45Ca9K7/s1600/RStudio.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="340" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEggkO3hNHsdxIpL01HiSUOPqWmwo3PnoTGkyNTGDo3lYx4F_mXAVI2vEU50wYgGTJlqcyxDSXuqtQiMSBUSzLdZbRubSsBnvrDVA39DTeIFPtrtIpFbuxL_GNWP03KN4XXjS87xY45Ca9K7/s640/RStudio.png" width="640" /></a></div>
<br />
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;">
Okay, now to some code. Let us see what the syntax is for the ubiquitous hello-world program. Because you can think of R as a programming language or statistical software or both of these super imposed on one, R has more than one way of accomplishing the same thing. And every command you execute is written to the console output by default. So, if you want to see the result of a simple math operation/conditional expression, just type them in the inputs and the results are up on the console:
</span></div>
<!-- HTML generated using hilite.me --><br />
<div style="background: #ffffff; border-width: .1em .1em .1em .8em; border: solid gray; overflow: auto; padding: .2em .6em; width: auto;">
<pre style="line-height: 125%; margin: 0;"><span style="color: #333333;">></span> <span style="color: #6600ee; font-weight: bold;">2</span><span style="color: #333333;">*</span><span style="color: #6600ee; font-weight: bold;">13</span>
[<span style="color: #6600ee; font-weight: bold;">1</span>] <span style="color: #6600ee; font-weight: bold;">26</span>
<span style="color: #333333;">></span> <span style="color: #6600ee; font-weight: bold;">3</span><span style="color: #333333;">^</span><span style="color: #6600ee; font-weight: bold;">2</span>
[<span style="color: #6600ee; font-weight: bold;">1</span>] <span style="color: #6600ee; font-weight: bold;">9</span>
<span style="color: #333333;">></span> sqrt(<span style="color: #6600ee; font-weight: bold;">3940225</span>)
[<span style="color: #6600ee; font-weight: bold;">1</span>] <span style="color: #6600ee; font-weight: bold;">1985</span>
</pre>
</div>
<br />
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;">
Notice the square parentheses before each result? That is because all results are converted internally before output. The basic data type in R is called a ‘vector’ and can be thought of as an array. So, when the output is presented as <br />
[1] 26
<br />it just means that R has created a vector of just one element to store the value of ‘26’ in the first column of the first row.
</span><br />
<span style="font-family: Arial, Helvetica, sans-serif;"><br /></span></div>
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;">
Coming back to our ‘hello world’ program, all we need to do to output text on the console is to use the function ‘<a href="http://stat.ethz.ch/R-manual/R-devel/library/base/html/print.html">print</a>’ which outputs values to the console
</span></div><br/>
<!-- HTML generated using hilite.me --><br />
<div style="background: #ffffff; border-width: .1em .1em .1em .8em; border: solid gray; overflow: auto; padding: .2em .6em; width: auto;">
<pre style="line-height: 125%; margin: 0;">print (‘hello world’)
[<span style="color: #6600ee; font-weight: bold;">1</span>] “hello world”
</pre>
</div>
<br />
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;">
There you go! There are a lot of options in the print function itself. For example, if you don’t like the quotes to be present in your output, you can remove it. If you want to join two vectors and then print them, you can do that too using the c() operator to join vectors. Some examples below:
</span></div>
<!-- HTML generated using hilite.me --><br />
<div style="background: #ffffff; border-width: .1em .1em .1em .8em; border: solid gray; overflow: auto; padding: .2em .6em; width: auto;">
<pre style="line-height: 125%; margin: 0;"><span style="color: #333333;">></span> print(<span style="background-color: #fff0f0;">'hello world'</span>)
[<span style="color: #6600ee; font-weight: bold;">1</span>] <span style="background-color: #fff0f0;">"hello world"</span>
<span style="color: #333333;">></span> print(<span style="background-color: #fff0f0;">'hello world'</span>,quote<span style="color: #333333;">=</span><span style="color: #008800; font-weight: bold;">F</span>)
[<span style="color: #6600ee; font-weight: bold;">1</span>] hello world
<span style="color: #333333;">></span> print(pi)
[<span style="color: #6600ee; font-weight: bold;">1</span>] <span style="color: #6600ee; font-weight: bold;">3.141593</span>
<span style="color: #333333;">></span> print(pi,digits<span style="color: #333333;">=</span><span style="color: #6600ee; font-weight: bold;">3</span>)
[<span style="color: #6600ee; font-weight: bold;">1</span>] <span style="color: #6600ee; font-weight: bold;">3.14</span>
<span style="color: #333333;">></span> print(c(<span style="background-color: #fff0f0;">'The value of pi is'</span>,pi))
[<span style="color: #6600ee; font-weight: bold;">1</span>] <span style="background-color: #fff0f0;">"The value of pi is"</span> <span style="background-color: #fff0f0;">"3.14159265358979"</span>
<span style="color: #333333;">></span> print(c(<span style="background-color: #fff0f0;">'The value of pi is'</span>,pi),quote<span style="color: #333333;">=</span><span style="color: #008800; font-weight: bold;">F</span>)
[<span style="color: #6600ee; font-weight: bold;">1</span>] The value of pi is <span style="color: #6600ee; font-weight: bold;">3.14159265358979</span>
</pre>
</div>
<br />
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;">
You have now accomplished your first programming task in R with the printing of 'hello world'. But wait, there is more… do you see that [1] before all the print outputs? What does that mean? Simply put, it means that R has converted whatever you passed to the function ‘print’ into a vector and written out the result to the console. And when you passed two arguments to print, the result that got printed had two columns. You can already see that it can get messy if you passed many arguments to print like this:
</span></div>
<!-- HTML generated using hilite.me --><br />
<div style="background: #ffffff; border-width: .1em .1em .1em .8em; border: solid gray; overflow: auto; padding: .2em .6em; width: auto;">
<pre style="line-height: 125%; margin: 0;"><span style="color: #333333;">></span> print(rep(pi,<span style="color: #6600ee; font-weight: bold;">20</span>),digits<span style="color: #333333;">=</span><span style="color: #6600ee; font-weight: bold;">3</span>)
[<span style="color: #6600ee; font-weight: bold;">1</span>] <span style="color: #6600ee; font-weight: bold;">3.14</span> <span style="color: #6600ee; font-weight: bold;">3.14</span> <span style="color: #6600ee; font-weight: bold;">3.14</span> <span style="color: #6600ee; font-weight: bold;">3.14</span> <span style="color: #6600ee; font-weight: bold;">3.14</span> <span style="color: #6600ee; font-weight: bold;">3.14</span> <span style="color: #6600ee; font-weight: bold;">3.14</span> <span style="color: #6600ee; font-weight: bold;">3.14</span> <span style="color: #6600ee; font-weight: bold;">3.14</span> <span style="color: #6600ee; font-weight: bold;">3.14</span> <span style="color: #6600ee; font-weight: bold;">3.14</span> <span style="color: #6600ee; font-weight: bold;">3.14</span> <span style="color: #6600ee; font-weight: bold;">3.14</span> <span style="color: #6600ee; font-weight: bold;">3.14</span> <span style="color: #6600ee; font-weight: bold;">3.14</span>
[<span style="color: #6600ee; font-weight: bold;">16</span>] <span style="color: #6600ee; font-weight: bold;">3.14</span> <span style="color: #6600ee; font-weight: bold;">3.14</span> <span style="color: #6600ee; font-weight: bold;">3.14</span> <span style="color: #6600ee; font-weight: bold;">3.14</span> <span style="color: #6600ee; font-weight: bold;">3.14</span>
</pre>
</div>
<br />
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;">
How do we then stop R from doing that automatic conversion to the vector? This is where cat comes in:
</span></div><br/>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhoGUabOXQ8H-zn-mim-sxiS9Bdu7ry-vVIxSZvbyDf-yNoBfaL0WXdHqk_Kzti3suoXnz52w8nkub_kD2_tmBO0yt_UbbrdjnRoLRYGNPZ_cCOCA0zLOUvdBH-KRSkdRHIrzcc2_yWaHYC/s1600/cat.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhoGUabOXQ8H-zn-mim-sxiS9Bdu7ry-vVIxSZvbyDf-yNoBfaL0WXdHqk_Kzti3suoXnz52w8nkub_kD2_tmBO0yt_UbbrdjnRoLRYGNPZ_cCOCA0zLOUvdBH-KRSkdRHIrzcc2_yWaHYC/s320/cat.jpg" /></a></div>
<br />
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;">
No, not that cat :) <br/>
But this one: the function <a href="http://stat.ethz.ch/R-manual/R-devel/library/base/html/cat.html">cat()</a>. If you have used excel, you already know of the formula to concatenate strings. Cat() does the exact same thing in R. And since it works with strings, the automatic conversion to vector will not happen. So the following would happen:
</span></div>
<!-- HTML generated using hilite.me --><br />
<div style="background: #ffffff; border-width: .1em .1em .1em .8em; border: solid gray; overflow: auto; padding: .2em .6em; width: auto;">
<pre style="line-height: 125%; margin: 0;"><span style="color: #333333;">></span> cat(<span style="background-color: #fff0f0;">'hello world'</span>)
hello world
<span style="color: #333333;">></span> cat(pi)
<span style="color: #6600ee; font-weight: bold;">3.141593</span>
<span style="color: #333333;">></span> cat(<span style="background-color: #fff0f0;">'The value of pi is'</span>,round(pi,digits<span style="color: #333333;">=</span><span style="color: #6600ee; font-weight: bold;">2</span>))
The value of pi is <span style="color: #6600ee; font-weight: bold;">3.14</span>
</pre>
</div>
<br />
<div style="text-align: justify;">
<span style="font-family: Arial, Helvetica, sans-serif;">
I’ll leave it at that for now. The key takeaway is that since R is a programmer’s language, you can accomplish most of the things which you desire, rather than stick to some convention which you would do if you used a proprietary software. If you didn't like either of print or cat, there are other functions too like paste(), printf() , etc which can do the same thing. Sometimes, exploring all of this can get a little overwhelming and seem futile. But that’s where the power of open source comes in. R has a lot of support forums and communities where you can search for the exact function which will suit your exact need. I refer to ‘stack overflow’ and ‘stat exchange’ and in most cases get whatever I need. You can explore them whenever you need help. I’ll take leave now and come back with more interesting posts soon. Till then, happy explo’R’ing ! :)
</span></div>
</div><br/>
<br/>
<br/>
<!-- Place this Facebook Standard like code Where you want to see -->
<div class="fb-like" data-send="true" data-width="450" data-show-faces="false">
</div>
<!-- End of Facebook Standard like code -->Shashiahttp://www.blogger.com/profile/01602809065610957096noreply@blogger.com0tag:blogger.com,1999:blog-4849113628830840898.post-7571841401791325722013-08-18T00:40:00.001+05:302013-08-18T00:45:01.587+05:30R you ready?<div dir="ltr" style="text-align: left;" trbidi="on">
<div style="text-align: justify;">
<span style="font-family: Georgia, Times New Roman, serif;">I have had this thought in mind for very long now – about opening a dedicated blog only for a programming language. I felt that it will be a great learning experience where I get to share whatever I read from other sources, pieces of code that I have tried on my own and learn from other fellow coders in the blogverse. However, lack of time and my own internal inhibitions were always stopping me from converting the thought into action. </span></div><br />
<div style="text-align: justify;">
<span style="font-family: Georgia, Times New Roman, serif;">One of the main inhibitions inside me which kept preventing me from starting an exclusive programming language blog was the lack of confidence to call myself a programmer. I have always been fascinated with technology, especially the information technology industry, and spent my entire career working on technology solutions. Although I can devote hours of effort in debugging code and finding out things which don’t work, writing an efficient or fascinating piece of code does not come naturally to me. I have known some, met some and worked with gifted programmers who write codes like a breeze. I have been in awe of their programming capabilities. What I came to realize upon interactions with them was that even if one does not have gifted coding skills, it takes some effort to become a ‘spotter’ – someone who can spot nicely written code and appreciate the beauty and the craftiness which goes in coming up with such lines. I want to be a spotter, a collector or an integrator of sorts who collects masterpieces of code-art into one nice collection that can serve as an archive for anyone who wants to delve deep into! Although a coder might do the job of an artist by painting a nice picture, it is the collector who puts up the picture on display and showcases the art to the people interested in it. This blog will be an effort to do exactly that – collect all nice pieces of code and integrate them here. And yes, due credit and appreciation will definitely be given to the deserving artists!</span></div><br/>
<div style="text-align: justify;">
<b><span style="font-family: Georgia, Times New Roman, serif;">The choice of the language</span></b></div><br/>
<div style="text-align: justify;">
<span style="font-family: Georgia, Times New Roman, serif;">So, having decided to start a programming blog, the immediate question was that of the programming language itself. It was a little more than a year ago that I set foot into the world of data analytics, data mining and statistical modeling and was quite fascinated by it. There were a lot of statistical packages available, but majority of the work in corporate analytics continued to be done on… you guessed it right … EXCEL – the ubiquitous tool on which most consulting, IT, finance, and business organizations rely on, even to this day. Apart from this, there was other analytical software available like eviews, matlab, stata, crystal ball, etc but the choice was always going to be among the big three – <a href="http://www-01.ibm.com/software/analytics/spss/">SPSS</a>, <a href="http://www.sas.com/">SAS</a> and <a href="http://www.r-project.org/">R</a>. </span></div><br/>
<div style="text-align: justify;">
<span style="font-family: Georgia, Times New Roman, serif;">SPSS is IBM’s proprietary tool for data analysis and finds its origin in the social sciences. SAS is proprietary too – it comes from the statistical sciences pavilion and its procedures are used quite extensively to build models in marketing and life sciences. And then there is R – an offering from the GNU community, backed by the power of object oriented concepts in C++/JAVA which is highly extensible. Coming from a programming background, the choice of the language to create a blog on seemed quite obvious – it had to be R! Open source, highly powerful, vectorization for complex tasks, extremely eye-catchy graphical support, extensibility through freely available packages, and lots of help on online forums are few things which distinguish R and make it a natural choice for bloggers. But wait, there’s more to it. Most of the “data analysts” that I have come across in my industry come with an inherent bias against programming. In fact, a majority of the nascent analytics industry is formed from people who want to do something else other than IT jobs. This blog will be an attempt to woo all these programming averse candidates with the variety that R provides, and to demonstrate how simple it actually is to code some seemingly complex tasks using OOP concepts. No, you would not need a SAS/SPSS macro for complex tasks. </span></div><br/>
<div style="text-align: justify;">
<span style="font-family: Georgia, Times New Roman, serif;">While most of the content on this blog would make references to proprietary tools and procedures like SAS/SPSS, the intent would be showcasing the simplicity of the language R and not to show any other software in poor light. If you are looking for a comparative study on which software is better for statistical computing, this site is not going to help you. In fact, the debate on which software/tool is the best for data analytics has been on for quite some time now with no clear winner in sight. If you want my opinion on that, just stop worrying about the tool and instead focus on the design, technique or the underlying statistical concept. Once you master that, putting it on a tool becomes a formality. I read this somewhere – ‘<i>if your only tool is a hammer, every problem in the world looks like a nail</i>’. To know more about the comparative evaluation of statistical packages, visit the pages <a href="http://http//www.nytimes.com/2009/01/07/technology/business-computing/07program.html?_r=0">here</a>, <a href="http://r4stats.com/articles/popularity/">here</a> and <a href="http://www.theregister.co.uk/2011/02/07/revolution_r_sas_challenge/">here</a>.</span></div><br/>
<div style="text-align: justify;">
<span style="font-family: Georgia, Times New Roman, serif;">In fact, in spite of having a lot of online support and extensibility, R still has few limitations in terms of lack of easy interfaces for debugging and inability of the base package to support data higher than system’s RAM. As we go further in this blog, we will continue to explore each of these limitations and address the issue of how they can be worked around. And in cases where R does not have a solution, admit that other packages are better and move on.</span></div><br/>
<div style="text-align: justify;">
<span style="font-family: Georgia, Times New Roman, serif;">If you liked what you’ve read and want to join/contribute, please feel free to reach out to me. If you want to follow the blog and learn more about R, kindly click on the ‘follow’ button on the left side of the page. You can join through google or follow me on facebook <a href="https://www.facebook.com/shashidhar.shenoy">here</a>. Comments/suggestions for improvements are always welcome.</span></div><br/>
</div>Shashiahttp://www.blogger.com/profile/01602809065610957096noreply@blogger.com0