Wednesday, January 15, 2014

Binary logistic Regression on R : Concordance and Discordance

Logistic regression might not be the most trending in the analytics industry anymore. But is still bread and butter for most analytics folks, especially in the marketing decision sciences. Most of propensity models, survival analysis, churn measurement, etc are exclusively driven by this traditional yet powerful statistical technique.

A lot of material is available online to get started with building logistic regression models and getting the model fit criterion satisfied. If you are totally new to building logistic regression models, an excellent point to start off would be the UCLA help articles on building these binary logit models. Even before getting to the model building stage, some of the pre-processing and variable selection procedures must be followed in order to get good results, which would be the subject of a separate post. In this post we will cover some of the important model fit measures like Concordance, discordance, and other association measures like Somers D, gamma and Kendall’s Tau A which compare the predicted responses to actual responses.

The following questions will be answered during the course of this article:
  • Measures for logistic regression Concordance and discordance in R 
  • Somers'D, Gamma, Kendall’s Tau-a statistics in R

Concordance and Discordance in R
The most widely used code to run a logit model in R would be the glm() function with the ‘binomial’ variant. So, if you wanted to run a logistic regression model on the hypothetical dataset (available on the UCLS website here) , all you need to do is load the data set in R and run the binary logit using the following code:

# Clear workspace objects
rm(list=ls())

# Load the modelling dataset into workspace
model_data<-read.csv('binary.csv',header=T,sep=',',fill=T)

# Run a binary logistic regression model
logit_mod<-glm(formula=admit~gre+gpa+rank,
               family='binomial',data=model_data)
# Display the summary
summary(logit_mod)

And this is how the model summary would look like:

Since all the co-efficients are significant and the residual deviance has reduced as compared to the null deviance, we can conclude that we have a fair model. But, looking at the model result this way, it would be really difficult to say how well this model performs. In OLS regression, the R-squared and its more refined measure adjusted R-square would be the ‘one-stop’ metric which would immediately tell us if the model was a good fit or not. And since this was a value between 0 and 1, we could easily change it to a percentage value and pass it off as ‘model accuracy’ for beginners and the not-so-much-math-oriented businesses. Unfortunately, looking at adj-R square would be totally irrelevant in case of logistic regression because we model the log odds ratio and it becomes very difficult in terms of explain ability

This is where concordance steps in to help. Concordance tells us the association between actual values and the values fitted by the model in percentage terms. Concordance is defined as the ratio of number of pairs where the 1 had a higher model score than the model score of zero to the total number of 1-0 pairs possible. A higher value for concordance (60-70%) means a better fitted model. However, a very large value for concordance (85-95%) could also suggest that the model is over-fitted and needs to be re-aligned to explain the entire population.

A straight-forward, non-optimal, brute-force approach to getting to concordance would be to write the following code after building the model:

###########################################################
# Function Bruteforce : for concordance, discordance, ties
# The function returns Concordance, discordance, and ties
# by taking a glm binomial model result as input.
# It uses the brute force method of two for-loops
###########################################################
bruteforce<-function(model){
  # Get all actual observations and their fitted values into a frame
  fitted<-data.frame(cbind(model$y,model$fitted.values))
  colnames(fitted)<-c('respvar','score')
  # Subset only ones
  ones<-fitted[fitted[,1]==1,]
  # Subset only zeros
  zeros<-fitted[fitted[,1]==0,]
  
  # Initialise all the values
  pairs_tested<-0
  conc<-0
  disc<-0
  ties<-0
  
  # Get the values in a for-loop
  for(i in 1:nrow(ones))
    {
    for(j in 1:nrow(zeros))
      {
      pairs_tested<-pairs_tested+1
      if(ones[i,2]>zeros[j,2]) {conc<-conc+1}
      else if(ones[i,2]==zeros[j,2]){ties<-ties+1}
      else {disc<-disc+1}
      }
  }
  # Calculate concordance, discordance and ties
  concordance<-conc/pairs_tested
  discordance<-disc/pairs_tested
  ties_perc<-ties/pairs_tested
  return(list("Concordance"=concordance,
              "Discordance"=discordance,
              "Tied"=ties_perc,
              "Pairs"=pairs_tested))
  }
All this code does is to iterate through each and every 1-0 pair to see if the model score of ‘1’ was greater than the model score of ‘0’. And based on this comparison, it classifies the pair as a concordant pair, discordant pair or a tied pair. The final values for concordance, discordance and ties are expressed as a percentage of the total number of the pairs tested. When this code is run, we see the following output on the console:

As can be seen, the model reports a concordance percentage of 69.2% which tells us that the model is fairly accurate.

Although the above code gets the job done, it can be a real burden on system resources because of the two ‘for-loops’ and no optimization done at all. So, as the modelling data set increases in size, using this function can sometimes lead to a heavy toll on system resources, long waiting time and sometimes, crashing the R-process altogether.

Alternatively, the following function which is provided by a fellow blogger Vaibhav here can be used which uses the power of vectorization in R and gives the same result by using less computation time. The code for the same is (originally posted at the above link):

###########################################################
# Function OptimisedConc : for concordance, discordance, ties
# The function returns Concordance, discordance, and ties
# by taking a glm binomial model result as input.
# Although it still uses two-for loops, it optimises the code
# by creating initial zero matrices
###########################################################
OptimisedConc=function(model)
{
  Data = cbind(model$y, model$fitted.values) 
  ones = Data[Data[,1] == 1,]
  zeros = Data[Data[,1] == 0,]
  conc=matrix(0, dim(zeros)[1], dim(ones)[1])
  disc=matrix(0, dim(zeros)[1], dim(ones)[1])
  ties=matrix(0, dim(zeros)[1], dim(ones)[1])
  for (j in 1:dim(zeros)[1])
  {
    for (i in 1:dim(ones)[1])
    {
      if (ones[i,2]>zeros[j,2])
      {conc[j,i]=1}
      else if (ones[i,2]<zeros[j,2])
      {disc[j,i]=1}
      else if (ones[i,2]==zeros[j,2])
      {ties[j,i]=1}
    }
  }
  Pairs=dim(zeros)[1]*dim(ones)[1]
  PercentConcordance=(sum(conc)/Pairs)*100
  PercentDiscordance=(sum(disc)/Pairs)*100
  PercentTied=(sum(ties)/Pairs)*100
  return(list("Percent Concordance"=PercentConcordance,"Percent Discordance"=PercentDiscordance,"Percent Tied"=PercentTied,"Pairs"=Pairs))
}
This code also does the same thing as above but using matrices already initialized with zeroes. The output and the measures for concordance,etc are exactly the same as in the bruteforce approach. So, the toll on system resources would be much lesser as compared to the earlier code, because it has taken the power of R into consideration. Now, just for the sake of comparison, let us just see what is the savings in terms of system resources by looking at the time taken to execute the two functions. We use the system.time() function to evaluate the time:


The second function does the same thing as the first using only 10% of the time! That is what vectorization can do in R.

Of course, there are other functions which can be written which will approximate the value of Concordance instead of calculating accurately using all the possible 1-0 pairs. One of the most frequently returned search URL when you search for Concordance is the following link at GITHUB . This code is even better in terms of performance as compared to the optimized function above, but the only catch is that it is not accurate. It has approximated the number of 1-0 pairs on the assumption that the data usually has as many number of ones as there are zeroes. If you calculate the concordance of the above model using this function, this is what you get:


The code has given a better value for Concordance (70.8%) instead of the actual value (69.2%). However this might get totally inaccurate if we had sorted the data to have all top scoring ones at the top of our data set, in which case Concordance would reach an unusually high value. The only thing about this code is that it is very quick, and can be used to get an approximate idea of what range the actual concordance would lie. And it does not even take a second to do that! My vote would still be for the OptimisedConc function.

Somers D, Gamma, Kendall’s Tau-a statistics in R
Once the total number of pairs, concordant pairs, tied pairs and discordant pairs are obtained, then calculation of the above statistics is pretty easy and straight forward. Gamma (more famous as Goodman and Kruskal Gamma) is the measure of association in a doubly ordered contingency table. Refer here for more info . It can be calculated as:


where P is the number of concordant pairs and Q is the number of discordant pairs and ‘T’ is the number of tied pairs. It is a measure of how well the model is able to distinguish between concordant pairs and compared to the discordant pairs.

Somers’D is almost similar to gamma, but however takes does not into account the tied number of pairs. So, usually, if there are tied pairs in the model, Somers’D is usually less than gamma and can be calculated as


Both Gamma and Somers’D have values ranging from zero to one and the higher value of them indicates better distinguishing ability for the model.

Kendall’s tau-a is one more measure of association in the model. It can be computed using the following formula:


Where N is the total number of observations in the model. It is again a value between 0 and 1, however, for any given model, Kendall’s tau would be much lesser than gamma or SomersD because Tau-A takes all possible pairs as the denominator while the others take only the 1-0 pairs in the denominator.

Once we know these definitions, we can modify the above function OptimisedConc to return even these values by adding the following lines of code just before the return statement like this:


PercentConcordance=(sum(conc)/Pairs)*100
  PercentDiscordance=(sum(disc)/Pairs)*100
  PercentTied=(sum(ties)/Pairs)*100
  N<-length(model$y)
  gamma<-(sum(conc)-sum(disc))/Pairs
  Somers_D<-(sum(conc)-sum(disc))/(Pairs-sum(ties))
  k_tau_a<-2*(sum(conc)-sum(disc))/(N*(N-1))
  return(list("Percent Concordance"=PercentConcordance,
              "Percent Discordance"=PercentDiscordance,
              "Percent Tied"=PercentTied,
              "Pairs"=Pairs,
              "Gamma"=gamma,
              "Somers D"=Somers_D,
              "Kendall's Tau A"=k_tau_a))
And the call to the function would return:


This post covered one of the practical considerations to be taken into account while running predictive models using R. In the upcoming posts, I plan to cover some of the ways the above outputs can be beautified using html and some of the other practical considerations while modeling on R. If you liked this post/found it useful, you can give me a thumbs up using comment/likes. I’ll be back with more on these areas of predictive modeling soon. Till then, happy modeling :)

Update: 18 Feb 2014
A follow-up to this article has been published today. Although the OptimisedConc works well to save time, it is very poor in terms of memory utilization. And hence, a better function named as 'fastConc' has been written which makes use of the native functionality.
You can find the new article and the function on this link.

7 comments:

  1. Hello,

    I am working on a R video project. It is supposed to have R video tutorials. Could I please use your codes in the videos with proper citation? Please let me know. I shall be grateful.

    Thanks and regards,

    Sayantee

    ReplyDelete
    Replies
    1. Hi Sayantee,

      Thanks for dropping by.
      Yes, please go ahead and use it with proper citations.
      Do let me know how the video tutorials turn out in the end. All the best!

      Regards,
      Shashi

      Delete
  2. Usually I never comment on blogs but your article is so convincing that I never stop myself to say something about it. You’re doing a great job Man,Keep it up. Bin lookup

    ReplyDelete
  3. a Perfect Explanation.. Loved it..

    Following your blog now.. :)

    ReplyDelete
  4. super Explanation.

    ReplyDelete
  5. Sir can you pls tell what is 'model' in this function?

    ReplyDelete
    Replies
    1. Hello, the 'model' is the argument you pass to the function. In this case, you would pass the 'logit_mod' object!

      Delete