Sunday, August 18, 2013

R you ready?

I have had this thought in mind for very long now – about opening a dedicated blog only for a programming language. I felt that it will be a great learning experience where I get to share whatever I read from other sources, pieces of code that I have tried on my own and learn from other fellow coders in the blogverse. However, lack of time and my own internal inhibitions were always stopping me from converting the thought into action.

One of the main inhibitions inside me which kept preventing me from starting an exclusive programming language blog was the lack of confidence to call myself a programmer. I have always been fascinated with technology, especially the information technology industry, and spent my entire career working on technology solutions. Although I can devote hours of effort in debugging code and finding out things which don’t work, writing an efficient or fascinating piece of code does not come naturally to me. I have known some, met some and worked with gifted programmers who write codes like a breeze. I have been in awe of their programming capabilities. What I came to realize upon interactions with them was that even if one does not have gifted coding skills, it takes some effort to become a ‘spotter’ – someone who can spot nicely written code and appreciate the beauty and the craftiness which goes in coming up with such lines. I want to be a spotter, a collector or an integrator of sorts who collects masterpieces of code-art into one nice collection that can serve as an archive for anyone who wants to delve deep into! Although a coder might do the job of an artist by painting a nice picture, it is the collector who puts up the picture on display and showcases the art to the people interested in it. This blog will be an effort to do exactly that – collect all nice pieces of code and integrate them here. And yes, due credit and appreciation will definitely be given to the deserving artists!

The choice of the language

So, having decided to start a programming blog, the immediate question was that of the programming language itself. It was a little more than a year ago that I set foot into the world of data analytics, data mining and statistical modeling and was quite fascinated by it. There were a lot of statistical packages available, but majority of the work in corporate analytics continued to be done on… you guessed it right … EXCEL – the ubiquitous tool on which most consulting, IT, finance, and business organizations rely on, even to this day. Apart from this, there was other analytical software available like eviews, matlab, stata, crystal ball, etc but the choice was always going to be among the big three – SPSS, SAS and R.

SPSS is IBM’s proprietary tool for data analysis and finds its origin in the social sciences. SAS is proprietary too – it comes from the statistical sciences pavilion and its procedures are used quite extensively to build models in marketing and life sciences. And then there is R – an offering from the GNU community, backed by the power of object oriented concepts in C++/JAVA which is highly extensible. Coming from a programming background, the choice of the language to create a blog on seemed quite obvious – it had to be R! Open source, highly powerful, vectorization for complex tasks, extremely eye-catchy graphical support, extensibility through freely available packages, and lots of help on online forums are few things which distinguish R and make it a natural choice for bloggers. But wait, there’s more to it. Most of the “data analysts” that I have come across in my industry come with an inherent bias against programming. In fact, a majority of the nascent analytics industry is formed from people who want to do something else other than IT jobs. This blog will be an attempt to woo all these programming averse candidates with the variety that R provides, and to demonstrate how simple it actually is to code some seemingly complex tasks using OOP concepts. No, you would not need a SAS/SPSS macro for complex tasks.

While most of the content on this blog would make references to proprietary tools and procedures like SAS/SPSS, the intent would be showcasing the simplicity of the language R and not to show any other software in poor light. If you are looking for a comparative study on which software is better for statistical computing, this site is not going to help you. In fact, the debate on which software/tool is the best for data analytics has been on for quite some time now with no clear winner in sight. If you want my opinion on that, just stop worrying about the tool and instead focus on the design, technique or the underlying statistical concept. Once you master that, putting it on a tool becomes a formality. I read this somewhere – ‘if your only tool is a hammer, every problem in the world looks like a nail’. To know more about the comparative evaluation of statistical packages, visit the pages here, here and here.

In fact, in spite of having a lot of online support and extensibility, R still has few limitations in terms of lack of easy interfaces for debugging and inability of the base package to support data higher than system’s RAM. As we go further in this blog, we will continue to explore each of these limitations and address the issue of how they can be worked around. And in cases where R does not have a solution, admit that other packages are better and move on.

If you liked what you’ve read and want to join/contribute, please feel free to reach out to me. If you want to follow the blog and learn more about R, kindly click on the ‘follow’ button on the left side of the page. You can join through google or follow me on facebook here. Comments/suggestions for improvements are always welcome.