Tuesday, February 18, 2014

Binary Logistic regression: Fast Concordance

This is a follow up to an earlier article on concordance in binary logistic regression. You can find the original article here. In that post, I had compared between 2-3 different ways of computing concordance, discordance and ties while running a binary logistic regression model on R. And the conclusion was that the OptimizedConc was an accurate, yet fast way to get to concordance in R. In this post we cover the following topics:
- Function for Fast and accurate Concordance in logit models using R
- Comparison of the fastConc function against other methods

My analyst friend wrote to me and complained that even the optimized function was not so optimized when it came to large datasets! It seems the data frame that he used had more than a million observations and the function always kept failing due to memory issues. It immediately occurred to me the culprit were the huge matrices which are created in the function. It creates 3 matrices (initialized with zeroes), each of which are of size (number of ones) * (number of zeros). So, if you had half a million ones and half a million zeroes in the dataset, you would need three matrices of (0.5M * 0.5M) each, even before the actual calculations in the ‘for’ loop begun.

As we sat and discussed about it, we knew that if we were to use this function on real data, the matrix allocations and the dual for-loops had to somehow be optimized. And being the geek that he is, my friend suggested an approach to reduce the number of ‘for’ loops from two to one. The function below, which I have called fastConc, reduces the number of ‘for’ loops to one and uses the native ‘subset’ feature in the loop to calculate the number of concordant and discordant pairs. It is one of the fastest functions which can give you exact concordance values and on performance side, it compares itself against the github code, which just gives approximate concordance values:

###########################################################
# Function fastConc : for concordance, discordance, ties
# The function returns Concordance, discordance, and ties
# by taking a glm binomial model result as input.
# It uses optimisation through subsetting
###########################################################
fastConc<-function(model){
  # Get all actual observations and their fitted values into a frame
  fitted<-data.frame(cbind(model$y,model$fitted.values))
  colnames(fitted)<-c('respvar','score')
  # Subset only ones
  ones<-fitted[fitted[,1]==1,]
  # Subset only zeros
  zeros<-fitted[fitted[,1]==0,]
  
  # Initialise all the values
  pairs_tested<-nrow(ones)*nrow(zeros)
  conc<-0
  disc<-0
    
  # Get the values in a for-loop
  for(i in 1:nrow(ones))
  {
    conc<-conc + sum(ones[i,"score"]>zeros[,"score"])
    disc<-disc + sum(ones[i,"score"]<zeros[,"score"])
  }
  # Calculate concordance, discordance and ties
  concordance<-conc/pairs_tested
  discordance<-disc/pairs_tested
  ties_perc<-(1-concordance-discordance)
  return(list("Concordance"=concordance,
              "Discordance"=discordance,
              "Tied"=ties_perc,
              "Pairs"=pairs_tested))
}

The output of the function is exactly similar to the OptimisedConc function and it returns the Concordance, Discordance, Ties, etc as ratios, than percentages, which can be easily changed.

Performance of the function
Intuitively, the function fastConc() seems to do better on memory as related to the optimisedConc just because it stores the concordance and discordance values in a count variable than in big matrices. So, how do all these functions match up on time? To check, I used a dataset with 20,000 observations which had 2000 ones and 18000 zeros (very low response model, you might say). There would be a total of (18000 * 2000) 36,000,000 pairs which need to be tested. And these are results of the functions:

> system.time(bruteforce(logit_mod))
   user  system elapsed 
4291.10    6.12 4479.85 

> system.time(OptimisedConc(logit_mod))
   user  system elapsed 
 221.98    0.45  223.69 
 
> system.time(fastConc(logit_mod))
   user  system elapsed 
   0.69    0.00    0.69 

As can be seen, bruteforce() took more than an hour to give me the concordance results! And I had almost given up when the system.time() function finally returned the value. OptimisedConc does lot better in terms of time 4 minutes, it is pathetic in terms of memory utilization! The fastConc() gives me the same result within a second, thanks to the native functions being used, and it consumes negligible memory.

So, the verdict is clear. There will be lots of situations like this, where multiple things seem to work and produce the same output. However, it is always best to choose the one which uses native functions for its implementations rather than other data heavy or user defined functions. If you ran logistic regression in other tools like SAS, you would not even worry about the functions, because they have already implemented it using ready-made native functions, and hence they tend to be really oiptimised!. As for concordance in R, the fastConc() now becomes my go-to function everytime I run a glm() code because of its sheer efficiency. If you have had any situation where you’ve used non-native functions to accomplish a task, let me know in comments. I’ll be back with more posts soon. Till then, take care!

4 comments:

  1. I am getting NA as output o fthat function..I gave the output of my glm() function as input to fastConc()

    ReplyDelete
  2. @ efg : I am not sure why that is the case, have you called the function as:
    fastConc(glm())

    or something similar?

    If yes, can you share the output of

    summary(glm())

    over here?

    ReplyDelete
  3. Hi Shashia, I'm also getting NA's as output. So, I was wondering if you have solved this?

    ReplyDelete
  4. this is anonymous from 28/8; my NA's were due to integer overflow in the pairs_tested line; nrow(ones)*nrow(zeros) is greater than my machine's max integer value.

    ReplyDelete