Sunday, September 29, 2013

Hello R World!

Okay, it has been quite a while since the grand opening to this blog a month ago. Without wasting too much time, let us get to business right away:

To start with R, let us begin with the basic question which has puzzled many a learner when they are introduced to language R : 'what should be an approach to go about learning it?’ or simply put ‘where do I start?’

  • A novice college student in an basic statistics course would want to think of R as an advanced calculator 
  • A statistician would want to think of R as an optimizer which would give the ‘best-fit’ model for all the observed data points 
  • Switchers from STATA/SPSS/SAS would want to see it as a replacement to their current software which they were so good at using (How do I get this R to do PROC SQL/PROC REG?’) 

While none of these perspectives on R are wrong, a programmer’s perspective would be to treat as an object oriented interpreted language, and understand the basic programming constructs and the programming environment, which is what I present to you in this post.

So, the first thing you do is to get the environment for coding set up on your system by downloading and installing R – a lot of support and documentation is available for the same, and let us skip that portion in this post.

Another thing which a programmer would be particularly interested in would be in the IDE for development. If you downloaded the base-R from CRAN website, you would have already gotten a basic GUI for R – with a console to type out commands and execute them line by line, along with a simple editor where you can write lines of code and execute them together. This would look something like this:


Going by the pace that R has caught up in the programming/analytics world, it was imperative that an IDE was needed and sometime around 2011 came RStudio – an open source IDE for R. I have been using it for over a year now and found it to be pretty useful – editor, graphics, console and the workspace information… all integrated into a single easy-to-use interface. RStudio has grown from strength to strength and it is now very popular among R users worldwide:


Okay, now to some code. Let us see what the syntax is for the ubiquitous hello-world program. Because you can think of R as a programming language or statistical software or both of these super imposed on one, R has more than one way of accomplishing the same thing. And every command you execute is written to the console output by default. So, if you want to see the result of a simple math operation/conditional expression, just type them in the inputs and the results are up on the console:

> 2*13
[1] 26
> 3^2
[1] 9
> sqrt(3940225)
[1] 1985 

Notice the square parentheses before each result? That is because all results are converted internally before output. The basic data type in R is called a ‘vector’ and can be thought of as an array. So, when the output is presented as
[1] 26
it just means that R has created a vector of just one element to store the value of ‘26’ in the first column of the first row.


Coming back to our ‘hello world’ program, all we need to do to output text on the console is to use the function ‘print’ which outputs values to the console


print (‘hello world’)
[1] “hello world”

There you go! There are a lot of options in the print function itself. For example, if you don’t like the quotes to be present in your output, you can remove it. If you want to join two vectors and then print them, you can do that too using the c() operator to join vectors. Some examples below:

> print('hello world')
[1] "hello world"
> print('hello world',quote=F)
[1] hello world
> print(pi)
[1] 3.141593
> print(pi,digits=3)
[1] 3.14
> print(c('The value of pi is',pi))
[1] "The value of pi is" "3.14159265358979"  
> print(c('The value of pi is',pi),quote=F)
[1] The value of pi is 3.14159265358979

You have now accomplished your first programming task in R with the printing of 'hello world'. But wait, there is more… do you see that [1] before all the print outputs? What does that mean? Simply put, it means that R has converted whatever you passed to the function ‘print’ into a vector and written out the result to the console. And when you passed two arguments to print, the result that got printed had two columns. You can already see that it can get messy if you passed many arguments to print like this:

> print(rep(pi,20),digits=3)
 [1] 3.14 3.14 3.14 3.14 3.14 3.14 3.14 3.14 3.14 3.14 3.14 3.14 3.14 3.14 3.14
[16] 3.14 3.14 3.14 3.14 3.14

How do we then stop R from doing that automatic conversion to the vector? This is where cat comes in:


No, not that cat :)
But this one: the function cat(). If you have used excel, you already know of the formula to concatenate strings. Cat() does the exact same thing in R. And since it works with strings, the automatic conversion to vector will not happen. So the following would happen:

> cat('hello world')
hello world
> cat(pi)
3.141593
> cat('The value of pi is',round(pi,digits=2))
The value of pi is 3.14

I’ll leave it at that for now. The key takeaway is that since R is a programmer’s language, you can accomplish most of the things which you desire, rather than stick to some convention which you would do if you used a proprietary software. If you didn't like either of print or cat, there are other functions too like paste(), printf() , etc which can do the same thing. Sometimes, exploring all of this can get a little overwhelming and seem futile. But that’s where the power of open source comes in. R has a lot of support forums and communities where you can search for the exact function which will suit your exact need. I refer to ‘stack overflow’ and ‘stat exchange’ and in most cases get whatever I need. You can explore them whenever you need help. I’ll take leave now and come back with more interesting posts soon. Till then, happy explo’R’ing ! :)