Blog [R]

Replicating PROC FASTCLUS in R using kmeans

2015-05-26T13:16:00.002+05:30

A lot of businesses have bought into the idea of making decisions driven by data. And R is one of the foremost statistical tool that is helping these executives to take those ‘data-driven’ decisions. A flip-side of being data-driven in your approach is that you get accustomed to looking at a certain type of data and in a particular format, freaking out even if there is a slight deviation from this standard. Hence, even if far better methods are available to solve a problem, the data scientists must usually prescribe to what is widely accepted in the industry.

It is in this context that we explore the classical analytics problem of classification techniques. There are new ‘state of the art’ machine learning/ neural network algorithms in this space like trees, random forests, Bayesian networks, fuzzy logic, etc. and R has an implementation for each of these. However,even before we venture into these classification techniques (with labeled data), we would want to find labels in the data using segmentation techniques. To this end, the one which still finds wide-use in the industry, is the ‘k-means’ clustering solution, and its proprietary implementation on SAS: PROC FASTCLUS. In this post, we discuss how to replicate the FASTCLUS procedure on R. This post covers the following topics:

- Kmeans clustering in R
- Replicating SAS PROC FASTCLUS in R
- Statistics for variables, Total STD, Within STD, R-Square and RSQ/(1-RSQ) ratios in R kmeans
- Visualization of k-means cluster results

K Means clustering in R

R implements k-means solution using the function kmeans. At the very basic level, a k-means algorithm is a minimization problem. It tries to partition ‘n’ observations into ‘k’ clusters such that the ‘within-cluster-sum-of-squares’ is minimum. It might not be the most efficient way to cluster data when you know nothing about the data, but if you have an idea that there are fixed patterns/finite number of segments in the data, you can use k-means to validate this intuition. Usually, a k-means solution is run after a more generalized hierarchical clustering technique. You can read more about the details of the algorithm, its drawbacks and overall efficiency here

If you want to run k-means clustering on the famous Fisher’s Iris data on R, you just have to use the command:

kmeans(x = iris[-5],centers = 3,iter.max = 100)

and right away, you’ll have the following output on the console:

Note: You might have a different starting solution because k-means is an optimizing algorithm which is highly dependent on the initial seeds. Run the command some two/three times and you’ll have the same solution. You could also use the ‘set.seed(XX)’ to always get the solution with 50, 62, 38 observations in the three clusters.

The ‘cluster means’ tells the individual means of the variables in the respective clusters. The clustering vector tells us the cluster to which every observation belongs to. The ratio (between_SS/total_SS = 88.4%) tells us that upto 88.4% of the variance is between the clusters and only 100-88.4 = 11.6% is within the clusters. This tells us that the clusters are more or less tightly packed – a desired objective of k-means clustering. The higher the between_SS/total_SS ratio(also known as the overall R-SQUARE), the better is our cluster solution.

Replicating SAS PROC FASTCLUS in R

Due its dominance in business circles, the ODS output for SAS is something that most people are accustomed to looking at. And if you ran a PROC FASTCLUS on SAS on the same famous IRIS data, this how you would do it:

proc fastclus data=sashelp.iris maxc=3 maxiter=10 out=clus;
var SepalLength SepalWidth PetalLength PetalWidth;
run;

And this is what the output would look like:

The first thing you notice is that there is a lot of output as compared to R. And it might be a bit overwhelming! But if you look at it closely, the crux of the output still remains that the final solution of 3 clusters has 38, 50 and 62 observations and the overall R_SQUARE value is 88.4%, both of which were already reported in R. However, the statistics for variables, the pseudo-F, CCC, cluster means/std deviations, etc are some of the additional outputs which SAS presents in a nice format and something which the businesses have been used to looking at.

Although it is not possible to replicate the SAS result 100% because the initial seeds chosen by SAS and R would vary considerably, there are some statistics like the RSQ/(1-RSQ) ratio per variable and the pseudo-F, which would definitely enhance the R output, and help to take a call on what variables are performing better in terms of separating the observations. I have tried to search for ready-made packages that could help but there seems to be none as of now. However, there is some help in the materials presented here and here which are the basis for the code below:

In order to replicate the exact output of SAS FASTCLUS, we first export the SAS’s IRIS data (which has all the variables SepalLength, SepalWidth,PetalLength and PetalWidth in mm as compared to IRIS data in R datasets package which has these in cm) and then import the same into R and then run the k-means:

sas_iris<-read.csv(file = 'sas_iris.csv',header = T,sep = ',',fill = T)
sas_clus<-kmeans(x = sas_iris[-1],centers = 3,iter.max = 100)
sas_clus

would produce the following output as we’ve seen before:

We see that although the absolute values for the within_SS have changed (due to change in variable scale), the overall R-SQ value still remains at 88.4% with the observations grouped into 3 clusters of 62, 38 and 50 as before.

Statistics for variables and RSQ/(1-RSQ) ratio

We now know that the variables SepalLength, SepalWidth,PetalLength and PetalWidth can together create a great deal of separation between the species. But is there a way to statistically know which of these would be the variable with the highest degree of separation and which of these is the least. This is where the table with statistics for variables comes in – something that the R output seems to miss out. According the post here , a good way to find that out would be to run a simple linear regression of the variable against the classified cluster and get the adjusted R-Square as the proxy for the strength of the variable:

Something like this:

sas_iris$classified<-sas_clus$cluster
summary(lm(formula = SepalLength~classified,data = sas_iris))
summary(lm(formula = SepalWidth~classified,data = sas_iris))
summary(lm(formula = PetalLength~classified,data = sas_iris))
summary(lm(formula = PetalWidth~classified,data = sas_iris))

Although it gives the R-Square, the variation is just too high to make any inference w.r.t strength of the variable.

An almost similar replica of the SAS output table can be generated by getting the Total Standard Deviation (Total STD), the within cluster Standard Deviation (Within STD) and the subsequent use of these in the formulae to get

Variable RSquare = 1- (within STD/Total STD)^2

and the RSQ/(1-RSQ) ratio as well. Although the exact SAS output upto 4 decimal places cannot be reproduced because the exact formulae used by SAS is not available anywhere on the internet, the ones used here are come close to the actual numbers and help us decide the strength of the individual variables.

sas_iris$classified<-sas_clus$cluster
variable.stats<-function(v,classified){
  tot<-sd(v)
  wth<-sqrt(sum(tapply(v,classified,FUN = function (x) {sum((x-mean(x))^2)})/(length(v)-unique(classified))))
  RSq<-1-(wth/tot)^2
  Ratio<-RSq/(1-RSq)
  a<-c(tot,wth,RSq,Ratio)
  a
}


vapply(X = sas_iris[,2:5],FUN = variable.stats, FUN.VALUE = c(Tot.STD=0,Within.STD=0,RSQ=0,RSQRatio=0),
       classified=sas_iris$classified)

And this will give the results in a tabular format, similar to SAS

By looking at this, we can now make out that the variable ‘PetalLength’ produces the highest degree of separation, while ‘SepalWidth’ has the least. So, if we were iteratively drop variables from the clustering, we would have to do away with the ‘SepalWidth’ variable first and so on.

The cluster means is already stored in the k-means result object. But if we want to generate the mean and standard deviations of each cluster, we can do it programmatically:

# Cluster means and standard deviations 
# SD 
sapply(X = sas_iris[,2:5],FUN = tapply,sas_iris$classified,sd)
# Mean
sapply(X = sas_iris[,2:5],FUN = tapply,sas_iris$classified,mean)
# Mean is same as cluster centers
sas_clus$centers

to produce the result like this:

The Pseudo-F statistic can also be generated programmatically by using the formula:

pseudo_F <- (sas_clus$betweenss/(length(sas_clus$size)-1))/(sas_clus$tot.withinss/(sum(sas_clus$size)-length(sas_clus$size)))

and the output would be:

> pseudo_F
[1] 561.6278

As for the cubic clustering criterion and approximate overall R-Squared, the results which are displayed on SAS seem to be a closely guarded secret and hence it is not exactly available on the internet, nor reproducible exactly in R. However, to get the ccc (cubic clustering criterion) , we could use the package NbClust :

library(NbClust)
NbClust(data = sas_iris[,2:5],min.nc = 3,max.nc = 3,method = 'kmeans',index = "ccc")

$All.index
  nc.Ward index.CCC 
  3.00000  37.67012

37.6 which is completely different from what the SAS output reports. However, there are a lot of arguments on which one is correct. Refer to the link here for more details. For now, let us just make peace with whatever is available on Nbclust and move forward.

Visualization of k-means cluster results

An area where R has a definite edge over SAS is the visualization of the results. Although the ODS has improved the plotting options on SAS, R is way ahead when it comes to creating colorful plots. So, once we have a cluster solution, we can use the powerful visualization features in R to create pretty plots. The reference used for this section of the post can be found here

Using the code as mentioned in the above article, we can create pretty plots for the IRIS data results like this:

As we can see, these plots have used the principal components decomposition to generate a 2-d plot for the 3 clusters.
A pair-wise plot can be created to confirm the strength of each variable like this:

pairs(x = sas_iris[,2:5], col=c(1:3)[sas_iris$classified])

The output confirms that PetalLength has a very high separating power, the species ‘Setosa’ (colored green) has PetalLengths between 10-20, while Versicolor and Verginica(colored black and red respectively) have lengths from 30-70mm.

Notes:
1. Although the results turn out to be similar on both the software in this example, there might cases when it will be impossible to match the results, as the internal implementations are way different. In some cases, even the cluster sizes will be lot different even when you run multiple iterations.
2. A major factor that influences the results is the scaling of variables. It is always recommended to have variables on the same scale in order to arrive at optimal results.

Binary Logistic regression: Fast Concordance

2014-02-18T22:52:00.000+05:30

This is a follow up to an earlier article on concordance in binary logistic regression. You can find the original article here. In that post, I had compared between 2-3 different ways of computing concordance, discordance and ties while running a binary logistic regression model on R. And the conclusion was that the OptimizedConc was an accurate, yet fast way to get to concordance in R. In this post we cover the following topics:

- Function for Fast and accurate Concordance in logit models using R
- Comparison of the fastConc function against other methods

My analyst friend wrote to me and complained that even the optimized function was not so optimized when it came to large datasets! It seems the data frame that he used had more than a million observations and the function always kept failing due to memory issues. It immediately occurred to me the culprit were the huge matrices which are created in the function. It creates 3 matrices (initialized with zeroes), each of which are of size (number of ones) * (number of zeros). So, if you had half a million ones and half a million zeroes in the dataset, you would need three matrices of (0.5M * 0.5M) each, even before the actual calculations in the ‘for’ loop begun.

As we sat and discussed about it, we knew that if we were to use this function on real data, the matrix allocations and the dual for-loops had to somehow be optimized. And being the geek that he is, my friend suggested an approach to reduce the number of ‘for’ loops from two to one. The function below, which I have called fastConc, reduces the number of ‘for’ loops to one and uses the native ‘subset’ feature in the loop to calculate the number of concordant and discordant pairs. It is one of the fastest functions which can give you exact concordance values and on performance side, it compares itself against the github code, which just gives approximate concordance values:

###########################################################
# Function fastConc : for concordance, discordance, ties
# The function returns Concordance, discordance, and ties
# by taking a glm binomial model result as input.
# It uses optimisation through subsetting
###########################################################
fastConc<-function(model){
  # Get all actual observations and their fitted values into a frame
  fitted<-data.frame(cbind(model$y,model$fitted.values))
  colnames(fitted)<-c('respvar','score')
  # Subset only ones
  ones<-fitted[fitted[,1]==1,]
  # Subset only zeros
  zeros<-fitted[fitted[,1]==0,]
  
  # Initialise all the values
  pairs_tested<-nrow(ones)*nrow(zeros)
  conc<-0
  disc<-0
    
  # Get the values in a for-loop
  for(i in 1:nrow(ones))
  {
    conc<-conc + sum(ones[i,"score"]>zeros[,"score"])
    disc<-disc + sum(ones[i,"score"]<zeros[,"score"])
  }
  # Calculate concordance, discordance and ties
  concordance<-conc/pairs_tested
  discordance<-disc/pairs_tested
  ties_perc<-(1-concordance-discordance)
  return(list("Concordance"=concordance,
              "Discordance"=discordance,
              "Tied"=ties_perc,
              "Pairs"=pairs_tested))
}

The output of the function is exactly similar to the OptimisedConc function and it returns the Concordance, Discordance, Ties, etc as ratios, than percentages, which can be easily changed.

Performance of the function

Intuitively, the function fastConc() seems to do better on memory as related to the optimisedConc just because it stores the concordance and discordance values in a count variable than in big matrices. So, how do all these functions match up on time? To check, I used a dataset with 20,000 observations which had 2000 ones and 18000 zeros (very low response model, you might say). There would be a total of (18000 * 2000) 36,000,000 pairs which need to be tested. And these are results of the functions:

> system.time(bruteforce(logit_mod))
   user  system elapsed 
4291.10    6.12 4479.85 

> system.time(OptimisedConc(logit_mod))
   user  system elapsed 
 221.98    0.45  223.69 
 
> system.time(fastConc(logit_mod))
   user  system elapsed 
   0.69    0.00    0.69

As can be seen, bruteforce() took more than an hour to give me the concordance results! And I had almost given up when the system.time() function finally returned the value. OptimisedConc does lot better in terms of time 4 minutes, it is pathetic in terms of memory utilization! The fastConc() gives me the same result within a second, thanks to the native functions being used, and it consumes negligible memory.

So, the verdict is clear. There will be lots of situations like this, where multiple things seem to work and produce the same output. However, it is always best to choose the one which uses native functions for its implementations rather than other data heavy or user defined functions. If you ran logistic regression in other tools like SAS, you would not even worry about the functions, because they have already implemented it using ready-made native functions, and hence they tend to be really oiptimised!. As for concordance in R, the fastConc() now becomes my go-to function everytime I run a glm() code because of its sheer efficiency. If you have had any situation where you’ve used non-native functions to accomplish a task, let me know in comments. I’ll be back with more posts soon. Till then, take care!

Excel style VLookup and RangeLookup in R

2014-01-30T13:53:00.000+05:30

A friend of mine, also an R enthusiast, came to me with this task that he was doing as part of a larger activity. The task seemed quite simple – assigning a bin value to each row of a dataset based on the information in a lookup table which contained information on the bins:

Data table (a large table of about 20,000+ rows) contains a variable called ‘indep1’, the values of which range from -30 to 280. The information on the bins is contained in the lookup table. And the bin numbers are such that they are in the increasing order of the ‘min_value’/’max_value’. The required output would be something like this:

An extra column in the data table indicating which bin the indep1 belongs to. Just to take care of the details, in case the value of the variable is at the border (say row number 4 in this case), it should go into the higher bin (bin 9 instead of bin 8).

Seemed like a simple but an interesting puzzle to solve on R. This post covers the following topics:
- Excel style Vlookup in R
- Range lookup in R similar to Vlookup in Excel
- Comparison among all the lookup functions

Lookup on R

Due to paucity of time, the first solution which the friend had tried was the non-algorithmic brute-force approach of iterating through all the bins (1 to 10) for all the 20,000 rows of data and assigning the bin numbers to the data table. Something like this:

### Brute force method
full_iterate_way<-function(data,lookup){
  data$bin_num<-0
  for(j in 1:nrow(lookup)){
    minval <- lookup[j, "min_value"]
    maxval <- lookup[j, "max_value"]
    label <- lookup[j, "bin_num"]
    
    for(k in 1:nrow(data)){
      if(data[k, "indep1"] >= minval & data[k, "indep1"] < maxval){
        data[k, "bin_num"] <- label
        }
    }
  }
  data
}

data_full<-full_iterate_way(data=data_table,lookup=lookup_table)

The function iterates through all the 20,000 rows of the data table for 10 times to assign the bin value to the variables. It does what is needed. However, if you were a programmer who looked at the code, you would immediately have apprehensions about code performance when there are two-for loops. And thus began the programmer’s quest to find alternative faster codes which would do the same.

If you had the data in excel, you would immediately know that this thing can be achieved using the RANGE LOOKUP property of the powerful VLOOKUP function, as the lookup table is anyway in the increasing order of bins. What’s more, if you need a column other than bin_info (say bin_weight) to be on the data_table, it would be a matter of just changing the argument 3 in Vlookup to get the desired column. So, the first improvement to the brute force would be to replicate the Vlookup (range lookup instead of the exact lookup) on R. Something like this:

rngLookup<-function(value, dataFrame,column){
  retVal<-dataFrame[value<dataFrame[,"min_value"],column][1]-1
  if(is.na(retVal)){retVal<-nrow(dataFrame)}
  retVal
}

lookup_way<-function(data,lookup){
  for(i in 1:nrow(data)){
    data$bin_num[i]<-rngLookup(data[i,2],dataFrame=lookup,column=2)
  }
  data
}

data_lookedup<-lookup_way(data_table,lookup=lookup_table)

Since we have function to do the lookup, we can call it for every row of the data frame eliminating one ‘for’loop. ‘data_lookedup’ would now contain the same information as in ‘data_full’. Just for the record, replacing the ‘<’ sign in the lookup function with ‘==’ sign can give you the exact VLookup function of excel in R.

Although the performance slightly improved after using the lookup function, it is still not an optimal way of going about things in R. This is mainly because we have still stuck to the programming paradigm of looping instead of the powerful vectorization and subsetting capabilities that R offers. So, we explore further and arrive at the next code to do the same thing - the ubiquitous and powerful SQL:

library(sqldf)
sql_way<-function(data,lookup){
  data<-sqldf("select A.*,B.bin_num from
              data A left join lookup B 
              ON (A.indep1 >= B.min_value and A.indep1 < B.max_value)")
  data
}

The library sqldf allows SQL codes directly on R data frames and this is one of the most elegant and optimal solution which I have come across to do the range lookup on R. The improvements to the performance are very substantial, as can be seen in the summary in the performance section below. However, the only thing which made me explore further for an even better alternative was that I was not convinced that a language like R, which is acclaimed to be one of the best for statistical analysis did not have a native function to achieve this simple task. And then I stumbled upon this:

findInterval

How could I miss something as trivial and intuitive like this in the first place? It was sort of the perfect answer to the question we asked and it was as native as R itself! And so, here is the simple code which will do what we had been trying to achieve all along – the one which I would prefer to use when the bin numbers start from ‘1’ and go on upto ‘10’ as in the example above:

find_interval<-function(data,lookup){
  data$label<-findInterval(x=data$indep1,vec=lookup$min_value)
  data
}
data_interval<-find_interval(data=data_table,lookup_table)

Comparison of the lookup functions

We now have 4 functions which do the exact same thing, and just by looking at them, we can assume the latter 2 to be more elegant than the earlier ones. However, let us also use the system.time function to see how each one of them performs when run on a data frame of 25000 rows:

As expected, the brute force method will be the slowest owing to the double-for-loops, and the lookup has only decreased the time on a linear scale. The SQL and the native findInterval win hands down by exponentially bringing down the time taken to perform the same task. The 'dual for-loop' brute force approach took 37 seconds to do the same thing, while the lookup just reduced it to 25 seconds , only a fractional improvement. The SQL got it down to 0.2 seconds and the findInterval did it in no time at all! The little overhead in the SQL as compared to the findInterval can be because of the Cartesian product table join it needs to perform.

Concluding remarks

1. Although all the functions above achieve the same result, there could be slight differences. Some of the functions might produce unexpected results when the variable value is at the extreme end of the lookup table. Say, if the variable value is 280 (the highest value), the brute force approach gives the bin_value of ‘0’ due to initialization and the SQL method gives a value of ‘NA’ because of join conditions not matching. However, the findInterval has no such problems because it anyway does the comparison only till the 9th bin and the 10th bin is anything greater than 230

2. The findInterval is not a complete lookup because it can return only numbers starting from 0/1 to the number of bins. Suppose in the above example, we also wanted to have the ‘bin_weight’ variable along with the ‘bin_num’ variable for all rows of indep1, then findInterval would not be able to achieve that, but there would be no such problem if we used the SQL method. Suppose we wanted to have the desired output (adding even the bin_weight column in output):

We could tweak the find_interval code to achieve this as well:

library(plyr)
find_interval<-function(data,lookup){
  data$bin_num<-findInterval(x=data$indep1,vec=lookup$min_value)
  data<-join(x=data,y=lookup,by="bin_num")[,c(1,2,3,7)]
  data
}
data_interval<-find_interval(data=data_table,lookup_table)

The addition of the merge statement in 'find_interval' makes it almost similar to SQL in terms of performance and functionality, and now either of them can be used in place of the earlier, brute force approach.

Though this seemed like a very simple exercise, I found a lot of ways to do one particular thing in R while exploring this. The best part about R is that we still cannot conclude if this is the best way of getting the desired outcome. And purists may go ahead and suggest the use of merge using ‘data.table’ which seems to be way faster than regular merge, using the lookup function from library qdap, or some combination of match(), etc. However, if you find that the code which you have does what you expect it do without being too heavy on resources, you can continue using the thing which works rather than going for the kill on optimization. If you used any better ways to achieve the same result, you are most welcome to share it. And if you found this post useful, please let me know that as well. I’ll be back writing more on these simple yet thought provoking exercises. Have fun!

Binary logistic Regression on R : Concordance and Discordance

2014-01-15T15:11:00.000+05:30

Logistic regression might not be the most trending in the analytics industry anymore. But is still bread and butter for most analytics folks, especially in the marketing decision sciences. Most of propensity models, survival analysis, churn measurement, etc are exclusively driven by this traditional yet powerful statistical technique.

A lot of material is available online to get started with building logistic regression models and getting the model fit criterion satisfied. If you are totally new to building logistic regression models, an excellent point to start off would be the UCLA help articles on building these binary logit models. Even before getting to the model building stage, some of the pre-processing and variable selection procedures must be followed in order to get good results, which would be the subject of a separate post. In this post we will cover some of the important model fit measures like Concordance, discordance, and other association measures like Somers D, gamma and Kendall’s Tau A which compare the predicted responses to actual responses.

The following questions will be answered during the course of this article:

Measures for logistic regression Concordance and discordance in R
Somers'D, Gamma, Kendall’s Tau-a statistics in R

Concordance and Discordance in R

The most widely used code to run a logit model in R would be the glm() function with the ‘binomial’ variant. So, if you wanted to run a logistic regression model on the hypothetical dataset (available on the UCLS website here) , all you need to do is load the data set in R and run the binary logit using the following code:

# Clear workspace objects
rm(list=ls())

# Load the modelling dataset into workspace
model_data<-read.csv('binary.csv',header=T,sep=',',fill=T)

# Run a binary logistic regression model
logit_mod<-glm(formula=admit~gre+gpa+rank,
               family='binomial',data=model_data)
# Display the summary
summary(logit_mod)

And this is how the model summary would look like:

Since all the co-efficients are significant and the residual deviance has reduced as compared to the null deviance, we can conclude that we have a fair model. But, looking at the model result this way, it would be really difficult to say how well this model performs. In OLS regression, the R-squared and its more refined measure adjusted R-square would be the ‘one-stop’ metric which would immediately tell us if the model was a good fit or not. And since this was a value between 0 and 1, we could easily change it to a percentage value and pass it off as ‘model accuracy’ for beginners and the not-so-much-math-oriented businesses. Unfortunately, looking at adj-R square would be totally irrelevant in case of logistic regression because we model the log odds ratio and it becomes very difficult in terms of explain ability

This is where concordance steps in to help. Concordance tells us the association between actual values and the values fitted by the model in percentage terms. Concordance is defined as the ratio of number of pairs where the 1 had a higher model score than the model score of zero to the total number of 1-0 pairs possible. A higher value for concordance (60-70%) means a better fitted model. However, a very large value for concordance (85-95%) could also suggest that the model is over-fitted and needs to be re-aligned to explain the entire population.

A straight-forward, non-optimal, brute-force approach to getting to concordance would be to write the following code after building the model:

###########################################################
# Function Bruteforce : for concordance, discordance, ties
# The function returns Concordance, discordance, and ties
# by taking a glm binomial model result as input.
# It uses the brute force method of two for-loops
###########################################################
bruteforce<-function(model){
  # Get all actual observations and their fitted values into a frame
  fitted<-data.frame(cbind(model$y,model$fitted.values))
  colnames(fitted)<-c('respvar','score')
  # Subset only ones
  ones<-fitted[fitted[,1]==1,]
  # Subset only zeros
  zeros<-fitted[fitted[,1]==0,]
  
  # Initialise all the values
  pairs_tested<-0
  conc<-0
  disc<-0
  ties<-0
  
  # Get the values in a for-loop
  for(i in 1:nrow(ones))
    {
    for(j in 1:nrow(zeros))
      {
      pairs_tested<-pairs_tested+1
      if(ones[i,2]>zeros[j,2]) {conc<-conc+1}
      else if(ones[i,2]==zeros[j,2]){ties<-ties+1}
      else {disc<-disc+1}
      }
  }
  # Calculate concordance, discordance and ties
  concordance<-conc/pairs_tested
  discordance<-disc/pairs_tested
  ties_perc<-ties/pairs_tested
  return(list("Concordance"=concordance,
              "Discordance"=discordance,
              "Tied"=ties_perc,
              "Pairs"=pairs_tested))
  }

All this code does is to iterate through each and every 1-0 pair to see if the model score of ‘1’ was greater than the model score of ‘0’. And based on this comparison, it classifies the pair as a concordant pair, discordant pair or a tied pair. The final values for concordance, discordance and ties are expressed as a percentage of the total number of the pairs tested. When this code is run, we see the following output on the console:

As can be seen, the model reports a concordance percentage of 69.2% which tells us that the model is fairly accurate.

Although the above code gets the job done, it can be a real burden on system resources because of the two ‘for-loops’ and no optimization done at all. So, as the modelling data set increases in size, using this function can sometimes lead to a heavy toll on system resources, long waiting time and sometimes, crashing the R-process altogether.

Alternatively, the following function which is provided by a fellow blogger Vaibhav here can be used which uses the power of vectorization in R and gives the same result by using less computation time. The code for the same is (originally posted at the above link):

###########################################################
# Function OptimisedConc : for concordance, discordance, ties
# The function returns Concordance, discordance, and ties
# by taking a glm binomial model result as input.
# Although it still uses two-for loops, it optimises the code
# by creating initial zero matrices
###########################################################
OptimisedConc=function(model)
{
  Data = cbind(model$y, model$fitted.values) 
  ones = Data[Data[,1] == 1,]
  zeros = Data[Data[,1] == 0,]
  conc=matrix(0, dim(zeros)[1], dim(ones)[1])
  disc=matrix(0, dim(zeros)[1], dim(ones)[1])
  ties=matrix(0, dim(zeros)[1], dim(ones)[1])
  for (j in 1:dim(zeros)[1])
  {
    for (i in 1:dim(ones)[1])
    {
      if (ones[i,2]>zeros[j,2])
      {conc[j,i]=1}
      else if (ones[i,2]<zeros[j,2])
      {disc[j,i]=1}
      else if (ones[i,2]==zeros[j,2])
      {ties[j,i]=1}
    }
  }
  Pairs=dim(zeros)[1]*dim(ones)[1]
  PercentConcordance=(sum(conc)/Pairs)*100
  PercentDiscordance=(sum(disc)/Pairs)*100
  PercentTied=(sum(ties)/Pairs)*100
  return(list("Percent Concordance"=PercentConcordance,"Percent Discordance"=PercentDiscordance,"Percent Tied"=PercentTied,"Pairs"=Pairs))
}

This code also does the same thing as above but using matrices already initialized with zeroes. The output and the measures for concordance,etc are exactly the same as in the bruteforce approach. So, the toll on system resources would be much lesser as compared to the earlier code, because it has taken the power of R into consideration. Now, just for the sake of comparison, let us just see what is the savings in terms of system resources by looking at the time taken to execute the two functions. We use the system.time() function to evaluate the time:

The second function does the same thing as the first using only 10% of the time! That is what vectorization can do in R.

Of course, there are other functions which can be written which will approximate the value of Concordance instead of calculating accurately using all the possible 1-0 pairs. One of the most frequently returned search URL when you search for Concordance is the following link at GITHUB . This code is even better in terms of performance as compared to the optimized function above, but the only catch is that it is not accurate. It has approximated the number of 1-0 pairs on the assumption that the data usually has as many number of ones as there are zeroes. If you calculate the concordance of the above model using this function, this is what you get:

The code has given a better value for Concordance (70.8%) instead of the actual value (69.2%). However this might get totally inaccurate if we had sorted the data to have all top scoring ones at the top of our data set, in which case Concordance would reach an unusually high value. The only thing about this code is that it is very quick, and can be used to get an approximate idea of what range the actual concordance would lie. And it does not even take a second to do that! My vote would still be for the OptimisedConc function.

Somers D, Gamma, Kendall’s Tau-a statistics in R

Once the total number of pairs, concordant pairs, tied pairs and discordant pairs are obtained, then calculation of the above statistics is pretty easy and straight forward. Gamma (more famous as Goodman and Kruskal Gamma) is the measure of association in a doubly ordered contingency table. Refer here for more info . It can be calculated as:

where P is the number of concordant pairs and Q is the number of discordant pairs and ‘T’ is the number of tied pairs. It is a measure of how well the model is able to distinguish between concordant pairs and compared to the discordant pairs.

Somers’D is almost similar to gamma, but however takes does not into account the tied number of pairs. So, usually, if there are tied pairs in the model, Somers’D is usually less than gamma and can be calculated as

Both Gamma and Somers’D have values ranging from zero to one and the higher value of them indicates better distinguishing ability for the model.

Kendall’s tau-a is one more measure of association in the model. It can be computed using the following formula:

Where N is the total number of observations in the model. It is again a value between 0 and 1, however, for any given model, Kendall’s tau would be much lesser than gamma or SomersD because Tau-A takes all possible pairs as the denominator while the others take only the 1-0 pairs in the denominator.

Once we know these definitions, we can modify the above function OptimisedConc to return even these values by adding the following lines of code just before the return statement like this:

PercentConcordance=(sum(conc)/Pairs)*100
  PercentDiscordance=(sum(disc)/Pairs)*100
  PercentTied=(sum(ties)/Pairs)*100
  N<-length(model$y)
  gamma<-(sum(conc)-sum(disc))/Pairs
  Somers_D<-(sum(conc)-sum(disc))/(Pairs-sum(ties))
  k_tau_a<-2*(sum(conc)-sum(disc))/(N*(N-1))
  return(list("Percent Concordance"=PercentConcordance,
              "Percent Discordance"=PercentDiscordance,
              "Percent Tied"=PercentTied,
              "Pairs"=Pairs,
              "Gamma"=gamma,
              "Somers D"=Somers_D,
              "Kendall's Tau A"=k_tau_a))

And the call to the function would return:

This post covered one of the practical considerations to be taken into account while running predictive models using R. In the upcoming posts, I plan to cover some of the ways the above outputs can be beautified using html and some of the other practical considerations while modeling on R. If you liked this post/found it useful, you can give me a thumbs up using comment/likes. I’ll be back with more on these areas of predictive modeling soon. Till then, happy modeling :)

Update: 18 Feb 2014

A follow-up to this article has been published today. Although the OptimisedConc works well to save time, it is very poor in terms of memory utilization. And hence, a better function named as 'fastConc' has been written which makes use of the native functionality.
You can find the new article and the function on this link.

VBA front end for R

2013-10-16T20:57:00.000+05:30

If you work in the analytics industry, I am sure you would have had this debate sometime or the other – pros and cons of R. While everyone agrees that R is quite powerful and has great graphics, most of us, especially those who have worked on GUI based tools like SASEG, etc agree that the text output of R can be pretty verbose. A colleague of mine ran a linear model and immediately exclaimed ‘it looks so bland!’

This set me exploring ways to beautify R – I found some interesting packages which would help in formatting the output – you can check out prettyR and html converter packages which can do wonders to the plain text output in R. However, my requirements were a little customized. We used excel in most of our day to day activities and VBA is quite powerful in parsing/formatting the results. So, why not use Excel and VBA to create a beautiful front end to run R? It could be a macro enabled tool which will read input from an excel sheet, run the regression code using RScript and display the formatted output on excel. Well, turns out that I was able to do all that and even more – this post explains the findings of my endeavor:

The following questions will be answered during the course of this article:
- How to run an RScript through MS Excel using VBA?
- How to run an RScript through command prompt? [in WINDOWS]
- How to pass arguments to an RScript through command line/external code? [in WINDOWS]
- How to read plain text files in MS Excel using VBA? [obviously WINDOWS :) ]

Just so that we know that our commands are executed correctly, let us write the following simple R code and save it in our directory ‘C:\R_code’ as ‘hello.R’

Contents of ‘C:\R_code\hello.R’
# Prints output to console
cat('Hello World')
var1<-5^3
var2<-7^3
cat('\nThe result of adding',var1,'to',var2,'is',var1+var2)

Running RScript through command prompt:

To be able to run R code through command prompt/other applications, you need to have the path of ‘R.exe’ and ‘RScript.exe’ in your system variable PATH. You can do this easily if you have admin rights to your system. Check this link to know how to do it on WINDOWS7. However, if you don’t have admin rights and want to add something to the PATH variable, don’t worry – you can easily add this to the USER variable PATH. Here are the steps on how to do this:

1. Suppose you have the ‘R.exe’ and ‘RScript.exe’ installed in the following directory: ‘C:\Program Files\R\R-2.15.3\bin\x64’. Copy this path to your clipboard.
2. Go to ‘Computer’ -- > Properties
3. On the left pane, click on ‘Advanced system settings’
4. On the ‘System properties’ dialog that opens up, navigate to the ‘Advanced’ tab and click on ‘Environment variables…’

5. On the ‘User variables’ click on ‘New…’

6. In the field ‘Variable Name:’, type PATH
7. In the field ‘Variable Value:’, paste the clipboard value, ie ‘C:\Program Files\R\R-2.15.3\bin\x64’. Add a semicolon ‘;’ after that.
8. Click on ‘Ok’ as many times to dismiss all dialog boxes.
9. Open command prompt and type ‘Rscript’ and hit ENTER. You will see the following:

Now that you have Rscript on your path, you can run R code from any directory on your system, including applications like MS-Excel through VBA. Just repeat step 9 by passing any *.R file as argument with the full path and it will execute as expected:

Running RScript through VBA:

To run this code in MS-Excel using VBA, you need to open a macro enabled worksheet (*.xlsm). To create a new macro enabled sheet, just create a new workbook and click on ‘Save As..’ and save as ‘Excel Macro-Enabled Workbook (*.xlsm)’. Once you have a macro-enabled workbook open, press the shortcut key combination ‘ALT + F11’ to open up the VBA editor. Once that is done, right click on the ‘Project Explorer’ to create a new module (which will be Module1 by default) and then type the following VBA code:

Sub RunRscript()
'runs an external R code through Shell
'The location of the RScript is 'C:\R_code'
'The script name is 'hello.R'

Dim shell As Object
Set shell = VBA.CreateObject("WScript.Shell")
Dim waitTillComplete As Boolean: waitTillComplete = True
Dim style As Integer: style = 1
Dim errorCode As Integer
Dim path As String
path = "RScript C:\R_code\hello.R"
errorCode = shell.Run(path, style, waitTillComplete)
End Sub

If you look at the VBA code carefully, it creates a Windows Shell object and invokes the R command through the shell. Also, the advantage of using Wscript.shell is that you can get VBA to wait till the execution is finished. To get more information on how to run a macro or use the VBA editor, you can refer to a lot of online tutorials that are easily available. A good place to start would be the MSDN tutorial which you can find here.

When you run this macro, you can see that a command window opens up, executes something and closes. But how do you know if the code has actually executed? A good way to redirect the console output on the R code to a file. You can use this by the sink function in R. Here is the modified R code which accomplishes the same:

# Re-directs the console output to a file 'hello.txt'
# The file is created in the directory 'C:\R_code'

sink('C:/R_code/hello.txt',append=F,type="output")
cat('Hello World')
var1<-5^3
var2<-7^3
cat('\nThe result of adding',var1,'to',var2,'is',var1+var2)
sink(NULL)

Once you’ve run the VBA macro, browse to C:\R_code and check if the ‘hello.txt’ has been created or not. If you can find the file there, then congratulations! You have run successfully used VBA to execute an R script.

Passing arguments to an RScript through command line/VBA:

Most of the work which we do requires us to pass inputs/parameters to a tool at runtime. In the code above, let’s say, we wanted ‘var1’ and ‘var2’ to be passed during runtime instead of being hardcoded the way they are right now. Let us create a simple excel tool which accepts two numbers and adds them, the front end would look like this:

Now, the only change in the VBA code would be to read inputs from cells D5 and F5 and pass it on to the RScript. The modified code would look like:

Sub RunRscript()
'runs an external R code through Shell
'The location of the RScript is 'C:\R_code'
'The script name is 'hello.R'
Dim shell As Object
Set shell = VBA.CreateObject("WScript.Shell")
Dim waitTillComplete As Boolean: waitTillComplete = True
Dim style As Integer: style = 1
Dim errorCode As Integer
Dim var1, var2 As Double
var1 = Sheet1.Range("D5").Value
var2 = Sheet1.Range("F5").Value

Dim path As String
path = "RScript C:\R_code\hello.R " & var1 & " " & var2

errorCode = shell.Run(path, style, waitTillComplete)

End Sub

Now, the VBA code is ready to pass two extra parameters to the Rscript and get it executed. But, the change on the input side means we will also have to change the R code to accept the input parameters and process them. This can be accomplished very well using the commandArgs function in R which will read the arguments and store it as a vector. The code changes as below:

# Accepts two numbers and adds them
# Re-directs the console output to a file 'hello.txt'
# The file is created in the directory 'C:\R_code'

args<-commandArgs(trailingOnly=T)
# cat(paste(args,collapse="\n"))
sink('C:/R_code/hello.txt',append=F,type="output")
cat('Hello World')
var1<-as.numeric(args[1])
var2<-as.numeric(args[2])
cat('\nThe result of adding',var1,'to',var2,'is',var1+var2)
sink(NULL)

Note the use of the ‘trailingOnly=T’ option in the commandArgs function. This would make the args vector store only those arguments which are passed by the USER. In addition to the USER arguments, RScript passes some system arguments by default. If you are interests in modifying those (like the directory of the R file, etc), then you would probably keep the trailingOnly argument to FALSE.

You now learnt how to invoke R from excel and how to pass data between R and excel. You can build on these two functionalities to develop some cool stuff which use Excel as front end and R as the backend. By the use of packages like ‘xlsx’ which can create data frames from excel sheets, you can go on to build so many applications like these:

Regression analysis tool can read input data from Excel and build OLS on R

K-Means Cluster tool

Reading text/picture files in MS Excel using VBA:

Although this part does not contain any R codes, I am posting this for the sake of completeness. This way, you will have one complete tool to play with. Once you have the output of R in a text file/picture file, you can read it back into Excel using VBA and display the nicely formatted result in excel. This part will be particularly useful if you want to create a tool that reads data from excel, does some statistical analysis using R in the backend and then displays the summary of the analysis. Here is the VBA code you can use to parse through a text file:

‘this code will read from a file ‘hello.txt’ and store the result Sheet2 starting from range A1 in consecutive rows
Dim sFile As String
sFile = "C:\R_code\hello.txt"

Dim rowNum As Integer
rowNum = 1
Set dest = Sheet2.Cells(rowNum, 1)

Open sFile For Input As #1
Do Until EOF(1)
    Input #1, ReadData
    If Not IsEmpty(ReadData) Then
        dest.Cells = ReadData
        rowNum = rowNum + 1
        Set dest = Sheet1.Cells(rowNum, 1)
    End If
Loop
Close #1 'close the opened file

And the below code can be used to copy pictures into VBA:

‘Inserts a picture located in R_code into Sheet2 at position A1 onwards
Sheet2.Range("$A$1").Select
Dim sFile As String
sFile = “C:\R_code\mypicture1.jpg"

ActiveSheet.Pictures.Insert(sFile) _
        .Select
Selection.ShapeRange.Height = 324
Selection.ShapeRange.Width = 396
    With Selection.ShapeRange.Line
        .Visible = msoTrue
        .ForeColor.ObjectThemeColor = msoThemeColorText1
        .ForeColor.TintAndShade = 0
        .ForeColor.Brightness = 0
        .Transparency = 0
    End With

So, with just little bit of code to format your results, you can get nicely formatted results in the way that you want. Below is the sample of the output from a linear regression model showing model accuracy, beta coefficients (from text file) and residual plots (from picture):

This is just the beginning. Once you have integrated R with VBA and vice versa, there is no limit to what you can achieve. R can be a powerful backend for computations where excel fails and I am sure we all agree that excel is still the de-facto standard for sharing and displaying summary reports. By using the interface techniques mentioned in this post, you can make the two of these complement each other very well. I would encourage you to try this out and let me know your thoughts in the comments below. If you like this post, then please follow this blog for more interesting posts, and tell your friends too :)

Hello R World!

2013-09-29T02:10:00.001+05:30

Okay, it has been quite a while since the grand opening to this blog a month ago. Without wasting too much time, let us get to business right away:

To start with R, let us begin with the basic question which has puzzled many a learner when they are introduced to language R : 'what should be an approach to go about learning it?’ or simply put ‘where do I start?’

A novice college student in an basic statistics course would want to think of R as an advanced calculator
A statistician would want to think of R as an optimizer which would give the ‘best-fit’ model for all the observed data points
Switchers from STATA/SPSS/SAS would want to see it as a replacement to their current software which they were so good at using (How do I get this R to do PROC SQL/PROC REG?’)

While none of these perspectives on R are wrong, a programmer’s perspective would be to treat as an object oriented interpreted language, and understand the basic programming constructs and the programming environment, which is what I present to you in this post.

So, the first thing you do is to get the environment for coding set up on your system by downloading and installing R – a lot of support and documentation is available for the same, and let us skip that portion in this post.

Another thing which a programmer would be particularly interested in would be in the IDE for development. If you downloaded the base-R from CRAN website, you would have already gotten a basic GUI for R – with a console to type out commands and execute them line by line, along with a simple editor where you can write lines of code and execute them together. This would look something like this:

Going by the pace that R has caught up in the programming/analytics world, it was imperative that an IDE was needed and sometime around 2011 came RStudio – an open source IDE for R. I have been using it for over a year now and found it to be pretty useful – editor, graphics, console and the workspace information… all integrated into a single easy-to-use interface. RStudio has grown from strength to strength and it is now very popular among R users worldwide:

Okay, now to some code. Let us see what the syntax is for the ubiquitous hello-world program. Because you can think of R as a programming language or statistical software or both of these super imposed on one, R has more than one way of accomplishing the same thing. And every command you execute is written to the console output by default. So, if you want to see the result of a simple math operation/conditional expression, just type them in the inputs and the results are up on the console:

> 2*13
[1] 26
> 3^2
[1] 9
> sqrt(3940225)
[1] 1985

Notice the square parentheses before each result? That is because all results are converted internally before output. The basic data type in R is called a ‘vector’ and can be thought of as an array. So, when the output is presented as
[1] 26
it just means that R has created a vector of just one element to store the value of ‘26’ in the first column of the first row.

Coming back to our ‘hello world’ program, all we need to do to output text on the console is to use the function ‘print’ which outputs values to the console

print (‘hello world’)
[1] “hello world”

There you go! There are a lot of options in the print function itself. For example, if you don’t like the quotes to be present in your output, you can remove it. If you want to join two vectors and then print them, you can do that too using the c() operator to join vectors. Some examples below:

> print('hello world')
[1] "hello world"
> print('hello world',quote=F)
[1] hello world
> print(pi)
[1] 3.141593
> print(pi,digits=3)
[1] 3.14
> print(c('The value of pi is',pi))
[1] "The value of pi is" "3.14159265358979"  
> print(c('The value of pi is',pi),quote=F)
[1] The value of pi is 3.14159265358979

You have now accomplished your first programming task in R with the printing of 'hello world'. But wait, there is more… do you see that [1] before all the print outputs? What does that mean? Simply put, it means that R has converted whatever you passed to the function ‘print’ into a vector and written out the result to the console. And when you passed two arguments to print, the result that got printed had two columns. You can already see that it can get messy if you passed many arguments to print like this:

> print(rep(pi,20),digits=3)
 [1] 3.14 3.14 3.14 3.14 3.14 3.14 3.14 3.14 3.14 3.14 3.14 3.14 3.14 3.14 3.14
[16] 3.14 3.14 3.14 3.14 3.14

How do we then stop R from doing that automatic conversion to the vector? This is where cat comes in:

No, not that cat :)
But this one: the function cat(). If you have used excel, you already know of the formula to concatenate strings. Cat() does the exact same thing in R. And since it works with strings, the automatic conversion to vector will not happen. So the following would happen:

> cat('hello world')
hello world
> cat(pi)
3.141593
> cat('The value of pi is',round(pi,digits=2))
The value of pi is 3.14

I’ll leave it at that for now. The key takeaway is that since R is a programmer’s language, you can accomplish most of the things which you desire, rather than stick to some convention which you would do if you used a proprietary software. If you didn't like either of print or cat, there are other functions too like paste(), printf() , etc which can do the same thing. Sometimes, exploring all of this can get a little overwhelming and seem futile. But that’s where the power of open source comes in. R has a lot of support forums and communities where you can search for the exact function which will suit your exact need. I refer to ‘stack overflow’ and ‘stat exchange’ and in most cases get whatever I need. You can explore them whenever you need help. I’ll take leave now and come back with more interesting posts soon. Till then, happy explo’R’ing ! :)

R you ready?

2013-08-18T00:40:00.001+05:30

I have had this thought in mind for very long now – about opening a dedicated blog only for a programming language. I felt that it will be a great learning experience where I get to share whatever I read from other sources, pieces of code that I have tried on my own and learn from other fellow coders in the blogverse. However, lack of time and my own internal inhibitions were always stopping me from converting the thought into action.

One of the main inhibitions inside me which kept preventing me from starting an exclusive programming language blog was the lack of confidence to call myself a programmer. I have always been fascinated with technology, especially the information technology industry, and spent my entire career working on technology solutions. Although I can devote hours of effort in debugging code and finding out things which don’t work, writing an efficient or fascinating piece of code does not come naturally to me. I have known some, met some and worked with gifted programmers who write codes like a breeze. I have been in awe of their programming capabilities. What I came to realize upon interactions with them was that even if one does not have gifted coding skills, it takes some effort to become a ‘spotter’ – someone who can spot nicely written code and appreciate the beauty and the craftiness which goes in coming up with such lines. I want to be a spotter, a collector or an integrator of sorts who collects masterpieces of code-art into one nice collection that can serve as an archive for anyone who wants to delve deep into! Although a coder might do the job of an artist by painting a nice picture, it is the collector who puts up the picture on display and showcases the art to the people interested in it. This blog will be an effort to do exactly that – collect all nice pieces of code and integrate them here. And yes, due credit and appreciation will definitely be given to the deserving artists!

The choice of the language

So, having decided to start a programming blog, the immediate question was that of the programming language itself. It was a little more than a year ago that I set foot into the world of data analytics, data mining and statistical modeling and was quite fascinated by it. There were a lot of statistical packages available, but majority of the work in corporate analytics continued to be done on… you guessed it right … EXCEL – the ubiquitous tool on which most consulting, IT, finance, and business organizations rely on, even to this day. Apart from this, there was other analytical software available like eviews, matlab, stata, crystal ball, etc but the choice was always going to be among the big three – SPSS, SAS and R.

SPSS is IBM’s proprietary tool for data analysis and finds its origin in the social sciences. SAS is proprietary too – it comes from the statistical sciences pavilion and its procedures are used quite extensively to build models in marketing and life sciences. And then there is R – an offering from the GNU community, backed by the power of object oriented concepts in C++/JAVA which is highly extensible. Coming from a programming background, the choice of the language to create a blog on seemed quite obvious – it had to be R! Open source, highly powerful, vectorization for complex tasks, extremely eye-catchy graphical support, extensibility through freely available packages, and lots of help on online forums are few things which distinguish R and make it a natural choice for bloggers. But wait, there’s more to it. Most of the “data analysts” that I have come across in my industry come with an inherent bias against programming. In fact, a majority of the nascent analytics industry is formed from people who want to do something else other than IT jobs. This blog will be an attempt to woo all these programming averse candidates with the variety that R provides, and to demonstrate how simple it actually is to code some seemingly complex tasks using OOP concepts. No, you would not need a SAS/SPSS macro for complex tasks.

While most of the content on this blog would make references to proprietary tools and procedures like SAS/SPSS, the intent would be showcasing the simplicity of the language R and not to show any other software in poor light. If you are looking for a comparative study on which software is better for statistical computing, this site is not going to help you. In fact, the debate on which software/tool is the best for data analytics has been on for quite some time now with no clear winner in sight. If you want my opinion on that, just stop worrying about the tool and instead focus on the design, technique or the underlying statistical concept. Once you master that, putting it on a tool becomes a formality. I read this somewhere – ‘if your only tool is a hammer, every problem in the world looks like a nail’. To know more about the comparative evaluation of statistical packages, visit the pages here, here and here.

In fact, in spite of having a lot of online support and extensibility, R still has few limitations in terms of lack of easy interfaces for debugging and inability of the base package to support data higher than system’s RAM. As we go further in this blog, we will continue to explore each of these limitations and address the issue of how they can be worked around. And in cases where R does not have a solution, admit that other packages are better and move on.

If you liked what you’ve read and want to join/contribute, please feel free to reach out to me. If you want to follow the blog and learn more about R, kindly click on the ‘follow’ button on the left side of the page. You can join through google or follow me on facebook here. Comments/suggestions for improvements are always welcome.