Wednesday, October 16, 2013

VBA front end for R

If you work in the analytics industry, I am sure you would have had this debate sometime or the other – pros and cons of R. While everyone agrees that R is quite powerful and has great graphics, most of us, especially those who have worked on GUI based tools like SASEG, etc agree that the text output of R can be pretty verbose. A colleague of mine ran a linear model and immediately exclaimed ‘it looks so bland!’

This set me exploring ways to beautify R – I found some interesting packages which would help in formatting the output – you can check out prettyR and html converter packages which can do wonders to the plain text output in R. However, my requirements were a little customized. We used excel in most of our day to day activities and VBA is quite powerful in parsing/formatting the results. So, why not use Excel and VBA to create a beautiful front end to run R? It could be a macro enabled tool which will read input from an excel sheet, run the regression code using RScript and display the formatted output on excel. Well, turns out that I was able to do all that and even more – this post explains the findings of my endeavor:

The following questions will be answered during the course of this article:
- How to run an RScript through MS Excel using VBA?
- How to run an RScript through command prompt? [in WINDOWS]
- How to pass arguments to an RScript through command line/external code? [in WINDOWS]
- How to read plain text files in MS Excel using VBA? [obviously WINDOWS :) ]

Just so that we know that our commands are executed correctly, let us write the following simple R code and save it in our directory ‘C:\R_code’ as ‘hello.R’

Contents of ‘C:\R_code\hello.R’
# Prints output to console
cat('Hello World')
var1<-5^3
var2<-7^3
cat('\nThe result of adding',var1,'to',var2,'is',var1+var2)

Running RScript through command prompt:

To be able to run R code through command prompt/other applications, you need to have the path of ‘R.exe’ and ‘RScript.exe’ in your system variable PATH. You can do this easily if you have admin rights to your system. Check this link to know how to do it on WINDOWS7. However, if you don’t have admin rights and want to add something to the PATH variable, don’t worry – you can easily add this to the USER variable PATH. Here are the steps on how to do this:

1. Suppose you have the ‘R.exe’ and ‘RScript.exe’ installed in the following directory: ‘C:\Program Files\R\R-2.15.3\bin\x64’. Copy this path to your clipboard.
2. Go to ‘Computer’ -- > Properties
3. On the left pane, click on ‘Advanced system settings’
4. On the ‘System properties’ dialog that opens up, navigate to the ‘Advanced’ tab and click on ‘Environment variables…’


5. On the ‘User variables’ click on ‘New…’


6. In the field ‘Variable Name:’, type PATH
7. In the field ‘Variable Value:’, paste the clipboard value, ie ‘C:\Program Files\R\R-2.15.3\bin\x64’. Add a semicolon ‘;’ after that.
8. Click on ‘Ok’ as many times to dismiss all dialog boxes.
9. Open command prompt and type ‘Rscript’ and hit ENTER. You will see the following:


Now that you have Rscript on your path, you can run R code from any directory on your system, including applications like MS-Excel through VBA. Just repeat step 9 by passing any *.R file as argument with the full path and it will execute as expected:


Running RScript through VBA:

To run this code in MS-Excel using VBA, you need to open a macro enabled worksheet (*.xlsm). To create a new macro enabled sheet, just create a new workbook and click on ‘Save As..’ and save as ‘Excel Macro-Enabled Workbook (*.xlsm)’. Once you have a macro-enabled workbook open, press the shortcut key combination ‘ALT + F11’ to open up the VBA editor. Once that is done, right click on the ‘Project Explorer’ to create a new module (which will be Module1 by default) and then type the following VBA code:

Sub RunRscript()
'runs an external R code through Shell
'The location of the RScript is 'C:\R_code'
'The script name is 'hello.R'

Dim shell As Object
Set shell = VBA.CreateObject("WScript.Shell")
Dim waitTillComplete As Boolean: waitTillComplete = True
Dim style As Integer: style = 1
Dim errorCode As Integer
Dim path As String
path = "RScript C:\R_code\hello.R"
errorCode = shell.Run(path, style, waitTillComplete)
End Sub

If you look at the VBA code carefully, it creates a Windows Shell object and invokes the R command through the shell. Also, the advantage of using Wscript.shell is that you can get VBA to wait till the execution is finished. To get more information on how to run a macro or use the VBA editor, you can refer to a lot of online tutorials that are easily available. A good place to start would be the MSDN tutorial which you can find here.

When you run this macro, you can see that a command window opens up, executes something and closes. But how do you know if the code has actually executed? A good way to redirect the console output on the R code to a file. You can use this by the sink function in R. Here is the modified R code which accomplishes the same:

# Re-directs the console output to a file 'hello.txt'
# The file is created in the directory 'C:\R_code'

sink('C:/R_code/hello.txt',append=F,type="output")
cat('Hello World')
var1<-5^3
var2<-7^3
cat('\nThe result of adding',var1,'to',var2,'is',var1+var2)
sink(NULL)

Once you’ve run the VBA macro, browse to C:\R_code and check if the ‘hello.txt’ has been created or not. If you can find the file there, then congratulations! You have run successfully used VBA to execute an R script.

Passing arguments to an RScript through command line/VBA:

Most of the work which we do requires us to pass inputs/parameters to a tool at runtime. In the code above, let’s say, we wanted ‘var1’ and ‘var2’ to be passed during runtime instead of being hardcoded the way they are right now. Let us create a simple excel tool which accepts two numbers and adds them, the front end would look like this:


Now, the only change in the VBA code would be to read inputs from cells D5 and F5 and pass it on to the RScript. The modified code would look like:

Sub RunRscript()
'runs an external R code through Shell
'The location of the RScript is 'C:\R_code'
'The script name is 'hello.R'
Dim shell As Object
Set shell = VBA.CreateObject("WScript.Shell")
Dim waitTillComplete As Boolean: waitTillComplete = True
Dim style As Integer: style = 1
Dim errorCode As Integer
Dim var1, var2 As Double
var1 = Sheet1.Range("D5").Value
var2 = Sheet1.Range("F5").Value

Dim path As String
path = "RScript C:\R_code\hello.R " & var1 & " " & var2

errorCode = shell.Run(path, style, waitTillComplete)

End Sub

Now, the VBA code is ready to pass two extra parameters to the Rscript and get it executed. But, the change on the input side means we will also have to change the R code to accept the input parameters and process them. This can be accomplished very well using the commandArgs function in R which will read the arguments and store it as a vector. The code changes as below:

# Accepts two numbers and adds them
# Re-directs the console output to a file 'hello.txt'
# The file is created in the directory 'C:\R_code'

args<-commandArgs(trailingOnly=T)
# cat(paste(args,collapse="\n"))
sink('C:/R_code/hello.txt',append=F,type="output")
cat('Hello World')
var1<-as.numeric(args[1])
var2<-as.numeric(args[2])
cat('\nThe result of adding',var1,'to',var2,'is',var1+var2)
sink(NULL)

Note the use of the ‘trailingOnly=T’ option in the commandArgs function. This would make the args vector store only those arguments which are passed by the USER. In addition to the USER arguments, RScript passes some system arguments by default. If you are interests in modifying those (like the directory of the R file, etc), then you would probably keep the trailingOnly argument to FALSE.

You now learnt how to invoke R from excel and how to pass data between R and excel. You can build on these two functionalities to develop some cool stuff which use Excel as front end and R as the backend. By the use of packages like ‘xlsx’ which can create data frames from excel sheets, you can go on to build so many applications like these:



Reading text/picture files in MS Excel using VBA:

Although this part does not contain any R codes, I am posting this for the sake of completeness. This way, you will have one complete tool to play with. Once you have the output of R in a text file/picture file, you can read it back into Excel using VBA and display the nicely formatted result in excel. This part will be particularly useful if you want to create a tool that reads data from excel, does some statistical analysis using R in the backend and then displays the summary of the analysis. Here is the VBA code you can use to parse through a text file:

this code will read from a file hello.txt and store the result Sheet2 starting from range A1 in consecutive rows
Dim sFile As String
sFile = "C:\R_code\hello.txt"

Dim rowNum As Integer
rowNum = 1
Set dest = Sheet2.Cells(rowNum, 1)

Open sFile For Input As #1
Do Until EOF(1)
    Input #1, ReadData
    If Not IsEmpty(ReadData) Then
        dest.Cells = ReadData
        rowNum = rowNum + 1
        Set dest = Sheet1.Cells(rowNum, 1)
    End If
Loop
Close #1 'close the opened file

And the below code can be used to copy pictures into VBA:

Inserts a picture located in R_code into Sheet2 at position A1 onwards
Sheet2.Range("$A$1").Select
Dim sFile As String
sFile = C:\R_code\mypicture1.jpg"

ActiveSheet.Pictures.Insert(sFile) _
        .Select
Selection.ShapeRange.Height = 324
Selection.ShapeRange.Width = 396
    With Selection.ShapeRange.Line
        .Visible = msoTrue
        .ForeColor.ObjectThemeColor = msoThemeColorText1
        .ForeColor.TintAndShade = 0
        .ForeColor.Brightness = 0
        .Transparency = 0
    End With

So, with just little bit of code to format your results, you can get nicely formatted results in the way that you want. Below is the sample of the output from a linear regression model showing model accuracy, beta coefficients (from text file) and residual plots (from picture):


This is just the beginning. Once you have integrated R with VBA and vice versa, there is no limit to what you can achieve. R can be a powerful backend for computations where excel fails and I am sure we all agree that excel is still the de-facto standard for sharing and displaying summary reports. By using the interface techniques mentioned in this post, you can make the two of these complement each other very well. I would encourage you to try this out and let me know your thoughts in the comments below. If you like this post, then please follow this blog for more interesting posts, and tell your friends too :)




Sunday, September 29, 2013

Hello R World!

Okay, it has been quite a while since the grand opening to this blog a month ago. Without wasting too much time, let us get to business right away:

To start with R, let us begin with the basic question which has puzzled many a learner when they are introduced to language R : 'what should be an approach to go about learning it?’ or simply put ‘where do I start?’

  • A novice college student in an basic statistics course would want to think of R as an advanced calculator 
  • A statistician would want to think of R as an optimizer which would give the ‘best-fit’ model for all the observed data points 
  • Switchers from STATA/SPSS/SAS would want to see it as a replacement to their current software which they were so good at using (How do I get this R to do PROC SQL/PROC REG?’) 

While none of these perspectives on R are wrong, a programmer’s perspective would be to treat as an object oriented interpreted language, and understand the basic programming constructs and the programming environment, which is what I present to you in this post.

So, the first thing you do is to get the environment for coding set up on your system by downloading and installing R – a lot of support and documentation is available for the same, and let us skip that portion in this post.

Another thing which a programmer would be particularly interested in would be in the IDE for development. If you downloaded the base-R from CRAN website, you would have already gotten a basic GUI for R – with a console to type out commands and execute them line by line, along with a simple editor where you can write lines of code and execute them together. This would look something like this:


Going by the pace that R has caught up in the programming/analytics world, it was imperative that an IDE was needed and sometime around 2011 came RStudio – an open source IDE for R. I have been using it for over a year now and found it to be pretty useful – editor, graphics, console and the workspace information… all integrated into a single easy-to-use interface. RStudio has grown from strength to strength and it is now very popular among R users worldwide:


Okay, now to some code. Let us see what the syntax is for the ubiquitous hello-world program. Because you can think of R as a programming language or statistical software or both of these super imposed on one, R has more than one way of accomplishing the same thing. And every command you execute is written to the console output by default. So, if you want to see the result of a simple math operation/conditional expression, just type them in the inputs and the results are up on the console:

> 2*13
[1] 26
> 3^2
[1] 9
> sqrt(3940225)
[1] 1985 

Notice the square parentheses before each result? That is because all results are converted internally before output. The basic data type in R is called a ‘vector’ and can be thought of as an array. So, when the output is presented as
[1] 26
it just means that R has created a vector of just one element to store the value of ‘26’ in the first column of the first row.


Coming back to our ‘hello world’ program, all we need to do to output text on the console is to use the function ‘print’ which outputs values to the console


print (‘hello world’)
[1] “hello world”

There you go! There are a lot of options in the print function itself. For example, if you don’t like the quotes to be present in your output, you can remove it. If you want to join two vectors and then print them, you can do that too using the c() operator to join vectors. Some examples below:

> print('hello world')
[1] "hello world"
> print('hello world',quote=F)
[1] hello world
> print(pi)
[1] 3.141593
> print(pi,digits=3)
[1] 3.14
> print(c('The value of pi is',pi))
[1] "The value of pi is" "3.14159265358979"  
> print(c('The value of pi is',pi),quote=F)
[1] The value of pi is 3.14159265358979

You have now accomplished your first programming task in R with the printing of 'hello world'. But wait, there is more… do you see that [1] before all the print outputs? What does that mean? Simply put, it means that R has converted whatever you passed to the function ‘print’ into a vector and written out the result to the console. And when you passed two arguments to print, the result that got printed had two columns. You can already see that it can get messy if you passed many arguments to print like this:

> print(rep(pi,20),digits=3)
 [1] 3.14 3.14 3.14 3.14 3.14 3.14 3.14 3.14 3.14 3.14 3.14 3.14 3.14 3.14 3.14
[16] 3.14 3.14 3.14 3.14 3.14

How do we then stop R from doing that automatic conversion to the vector? This is where cat comes in:


No, not that cat :)
But this one: the function cat(). If you have used excel, you already know of the formula to concatenate strings. Cat() does the exact same thing in R. And since it works with strings, the automatic conversion to vector will not happen. So the following would happen:

> cat('hello world')
hello world
> cat(pi)
3.141593
> cat('The value of pi is',round(pi,digits=2))
The value of pi is 3.14

I’ll leave it at that for now. The key takeaway is that since R is a programmer’s language, you can accomplish most of the things which you desire, rather than stick to some convention which you would do if you used a proprietary software. If you didn't like either of print or cat, there are other functions too like paste(), printf() , etc which can do the same thing. Sometimes, exploring all of this can get a little overwhelming and seem futile. But that’s where the power of open source comes in. R has a lot of support forums and communities where you can search for the exact function which will suit your exact need. I refer to ‘stack overflow’ and ‘stat exchange’ and in most cases get whatever I need. You can explore them whenever you need help. I’ll take leave now and come back with more interesting posts soon. Till then, happy explo’R’ing ! :)



Sunday, August 18, 2013

R you ready?

I have had this thought in mind for very long now – about opening a dedicated blog only for a programming language. I felt that it will be a great learning experience where I get to share whatever I read from other sources, pieces of code that I have tried on my own and learn from other fellow coders in the blogverse. However, lack of time and my own internal inhibitions were always stopping me from converting the thought into action.

One of the main inhibitions inside me which kept preventing me from starting an exclusive programming language blog was the lack of confidence to call myself a programmer. I have always been fascinated with technology, especially the information technology industry, and spent my entire career working on technology solutions. Although I can devote hours of effort in debugging code and finding out things which don’t work, writing an efficient or fascinating piece of code does not come naturally to me. I have known some, met some and worked with gifted programmers who write codes like a breeze. I have been in awe of their programming capabilities. What I came to realize upon interactions with them was that even if one does not have gifted coding skills, it takes some effort to become a ‘spotter’ – someone who can spot nicely written code and appreciate the beauty and the craftiness which goes in coming up with such lines. I want to be a spotter, a collector or an integrator of sorts who collects masterpieces of code-art into one nice collection that can serve as an archive for anyone who wants to delve deep into! Although a coder might do the job of an artist by painting a nice picture, it is the collector who puts up the picture on display and showcases the art to the people interested in it. This blog will be an effort to do exactly that – collect all nice pieces of code and integrate them here. And yes, due credit and appreciation will definitely be given to the deserving artists!

The choice of the language

So, having decided to start a programming blog, the immediate question was that of the programming language itself. It was a little more than a year ago that I set foot into the world of data analytics, data mining and statistical modeling and was quite fascinated by it. There were a lot of statistical packages available, but majority of the work in corporate analytics continued to be done on… you guessed it right … EXCEL – the ubiquitous tool on which most consulting, IT, finance, and business organizations rely on, even to this day. Apart from this, there was other analytical software available like eviews, matlab, stata, crystal ball, etc but the choice was always going to be among the big three – SPSS, SAS and R.

SPSS is IBM’s proprietary tool for data analysis and finds its origin in the social sciences. SAS is proprietary too – it comes from the statistical sciences pavilion and its procedures are used quite extensively to build models in marketing and life sciences. And then there is R – an offering from the GNU community, backed by the power of object oriented concepts in C++/JAVA which is highly extensible. Coming from a programming background, the choice of the language to create a blog on seemed quite obvious – it had to be R! Open source, highly powerful, vectorization for complex tasks, extremely eye-catchy graphical support, extensibility through freely available packages, and lots of help on online forums are few things which distinguish R and make it a natural choice for bloggers. But wait, there’s more to it. Most of the “data analysts” that I have come across in my industry come with an inherent bias against programming. In fact, a majority of the nascent analytics industry is formed from people who want to do something else other than IT jobs. This blog will be an attempt to woo all these programming averse candidates with the variety that R provides, and to demonstrate how simple it actually is to code some seemingly complex tasks using OOP concepts. No, you would not need a SAS/SPSS macro for complex tasks.

While most of the content on this blog would make references to proprietary tools and procedures like SAS/SPSS, the intent would be showcasing the simplicity of the language R and not to show any other software in poor light. If you are looking for a comparative study on which software is better for statistical computing, this site is not going to help you. In fact, the debate on which software/tool is the best for data analytics has been on for quite some time now with no clear winner in sight. If you want my opinion on that, just stop worrying about the tool and instead focus on the design, technique or the underlying statistical concept. Once you master that, putting it on a tool becomes a formality. I read this somewhere – ‘if your only tool is a hammer, every problem in the world looks like a nail’. To know more about the comparative evaluation of statistical packages, visit the pages here, here and here.

In fact, in spite of having a lot of online support and extensibility, R still has few limitations in terms of lack of easy interfaces for debugging and inability of the base package to support data higher than system’s RAM. As we go further in this blog, we will continue to explore each of these limitations and address the issue of how they can be worked around. And in cases where R does not have a solution, admit that other packages are better and move on.

If you liked what you’ve read and want to join/contribute, please feel free to reach out to me. If you want to follow the blog and learn more about R, kindly click on the ‘follow’ button on the left side of the page. You can join through google or follow me on facebook here. Comments/suggestions for improvements are always welcome.