A friend of mine, also an R enthusiast, came to me with this task that he was doing as part of a larger activity. The task seemed quite simple – assigning a bin value to each row of a dataset based on the information in a lookup table which contained information on the bins:
Data table (a large table of about 20,000+ rows) contains a variable called ‘indep1’, the values of which range from -30 to 280. The information on the bins is contained in the lookup table. And the bin numbers are such that they are in the increasing order of the ‘min_value’/’max_value’. The required output would be something like this:
An extra column in the data table indicating which bin the indep1 belongs to. Just to take care of the details, in case the value of the variable is at the border (say row number 4 in this case), it should go into the higher bin (bin 9 instead of bin 8).
Seemed like a simple but an interesting puzzle to solve on R. This post covers the following topics:
- Excel style Vlookup in R
- Range lookup in R similar to Vlookup in Excel
- Comparison among all the lookup functions
- Excel style Vlookup in R
- Range lookup in R similar to Vlookup in Excel
- Comparison among all the lookup functions
Lookup on R
Due to paucity of time, the first solution which the friend had tried was the non-algorithmic brute-force approach of iterating through all the bins (1 to 10) for all the 20,000 rows of data and assigning the bin numbers to the data table. Something like this:
### Brute force method full_iterate_way<-function(data,lookup){ data$bin_num<-0 for(j in 1:nrow(lookup)){ minval <- lookup[j, "min_value"] maxval <- lookup[j, "max_value"] label <- lookup[j, "bin_num"] for(k in 1:nrow(data)){ if(data[k, "indep1"] >= minval & data[k, "indep1"] < maxval){ data[k, "bin_num"] <- label } } } data } data_full<-full_iterate_way(data=data_table,lookup=lookup_table)
The function iterates through all the 20,000 rows of the data table for 10 times to assign the bin value to the variables. It does what is needed. However, if you were a programmer who looked at the code, you would immediately have apprehensions about code performance when there are two-for loops. And thus began the programmer’s quest to find alternative faster codes which would do the same.
If you had the data in excel, you would immediately know that this thing can be achieved using the RANGE LOOKUP property of the powerful VLOOKUP function, as the lookup table is anyway in the increasing order of bins. What’s more, if you need a column other than bin_info (say bin_weight) to be on the data_table, it would be a matter of just changing the argument 3 in Vlookup to get the desired column. So, the first improvement to the brute force would be to replicate the Vlookup (range lookup instead of the exact lookup) on R. Something like this:
rngLookup<-function(value, dataFrame,column){ retVal<-dataFrame[value<dataFrame[,"min_value"],column][1]-1 if(is.na(retVal)){retVal<-nrow(dataFrame)} retVal } lookup_way<-function(data,lookup){ for(i in 1:nrow(data)){ data$bin_num[i]<-rngLookup(data[i,2],dataFrame=lookup,column=2) } data } data_lookedup<-lookup_way(data_table,lookup=lookup_table)
Since we have function to do the lookup, we can call it for every row of the data frame eliminating one ‘for’loop. ‘data_lookedup’ would now contain the same information as in ‘data_full’. Just for the record, replacing the ‘<’ sign in the lookup function with ‘==’ sign can give you the exact VLookup function of excel in R.
Although the performance slightly improved after using the lookup function, it is still not an optimal way of going about things in R. This is mainly because we have still stuck to the programming paradigm of looping instead of the powerful vectorization and subsetting capabilities that R offers. So, we explore further and arrive at the next code to do the same thing - the ubiquitous and powerful SQL:
library(sqldf) sql_way<-function(data,lookup){ data<-sqldf("select A.*,B.bin_num from data A left join lookup B ON (A.indep1 >= B.min_value and A.indep1 < B.max_value)") data }
The library sqldf allows SQL codes directly on R data frames and this is one of the most elegant and optimal solution which I have come across to do the range lookup on R. The improvements to the performance are very substantial, as can be seen in the summary in the performance section below. However, the only thing which made me explore further for an even better alternative was that I was not convinced that a language like R, which is acclaimed to be one of the best for statistical analysis did not have a native function to achieve this simple task. And then I stumbled upon this:
How could I miss something as trivial and intuitive like this in the first place? It was sort of the perfect answer to the question we asked and it was as native as R itself! And so, here is the simple code which will do what we had been trying to achieve all along – the one which I would prefer to use when the bin numbers start from ‘1’ and go on upto ‘10’ as in the example above:
find_interval<-function(data,lookup){ data$label<-findInterval(x=data$indep1,vec=lookup$min_value) data } data_interval<-find_interval(data=data_table,lookup_table)
Comparison of the lookup functions
We now have 4 functions which do the exact same thing, and just by looking at them, we can assume the latter 2 to be more elegant than the earlier ones. However, let us also use the system.time function to see how each one of them performs when run on a data frame of 25000 rows:
As expected, the brute force method will be the slowest owing to the double-for-loops, and the lookup has only decreased the time on a linear scale. The SQL and the native findInterval win hands down by exponentially bringing down the time taken to perform the same task. The 'dual for-loop' brute force approach took 37 seconds to do the same thing, while the lookup just reduced it to 25 seconds , only a fractional improvement. The SQL got it down to 0.2 seconds and the findInterval did it in no time at all! The little overhead in the SQL as compared to the findInterval can be because of the Cartesian product table join it needs to perform.
Concluding remarks
1. Although all the functions above achieve the same result, there could be slight differences. Some of the functions might produce unexpected results when the variable value is at the extreme end of the lookup table. Say, if the variable value is 280 (the highest value), the brute force approach gives the bin_value of ‘0’ due to initialization and the SQL method gives a value of ‘NA’ because of join conditions not matching. However, the findInterval has no such problems because it anyway does the comparison only till the 9th bin and the 10th bin is anything greater than 230
2. The findInterval is not a complete lookup because it can return only numbers starting from 0/1 to the number of bins. Suppose in the above example, we also wanted to have the ‘bin_weight’ variable along with the ‘bin_num’ variable for all rows of indep1, then findInterval would not be able to achieve that, but there would be no such problem if we used the SQL method. Suppose we wanted to have the desired output (adding even the bin_weight column in output):
We could tweak the find_interval code to achieve this as well:
library(plyr) find_interval<-function(data,lookup){ data$bin_num<-findInterval(x=data$indep1,vec=lookup$min_value) data<-join(x=data,y=lookup,by="bin_num")[,c(1,2,3,7)] data } data_interval<-find_interval(data=data_table,lookup_table)
The addition of the merge statement in 'find_interval' makes it almost similar to SQL in terms of performance and functionality, and now either of them can be used in place of the earlier, brute force approach.
Though this seemed like a very simple exercise, I found a lot of ways to do one particular thing in R while exploring this. The best part about R is that we still cannot conclude if this is the best way of getting the desired outcome. And purists may go ahead and suggest the use of merge using ‘data.table’ which seems to be way faster than regular merge, using the lookup function from library qdap, or some combination of match(), etc. However, if you find that the code which you have does what you expect it do without being too heavy on resources, you can continue using the thing which works rather than going for the kill on optimization. If you used any better ways to achieve the same result, you are most welcome to share it. And if you found this post useful, please let me know that as well. I’ll be back writing more on these simple yet thought provoking exercises. Have fun!