Skip to main content

Variable Analysis

During the analysis of the variables and their relationship with the outcome variable, we can choose Bi variate or Multivariate. Both have their own advantages and disadvantages. 

Bi variate considers the effect of a single variable on the outcome variable, thus ignoring the effect of the other. Depending upon the correlation factor the variables can be either chosen for further analysis or else rejected. But, rejecting a variable just on the bases of correlation factor is not so wise decision. 

On the other hand, multivariate analysis checks the relationship between the outcome variable and all the other independent variables. From this type of analysis we get a more clear picture on how all the variables are affecting the outcome or dependent variables. The scenario in this case can be described as multi dimensional, cause there are more than 2/3 variables.

In the process of modeling, it becomes a very important for us to select the right variables so that the model can predict with high accuracy. In such case, running both the analysis step by step will reveal the more appropriate choices.

But, the choices must not be always made on the basis of the numerical values or percentages, sometime , it is be more wise to include a variable which may seem to have effect in the future or in the different conditions even though it may have low score.

Methods of choosing the variables

There are many statistical test available in order to choose the relevant variable. Few important ones are as follows:

3. Information Value

1. Chi-squared

The above method gives the correlation between the predictive variables and the log of the odds of a bad outcome. This allows to measure the predictive power of the variable, meaning - how important can this variable be for building a predictive model for the outcome variable.

2. Spearman correlation

The above method gives the correlation  between the ranking of the predictive variables and the outcome variables and not the real values of the variable. In this analysis, the relationship doesn't have to be linear, it just has to be proportional either in negative or positive sense.

3. Information Value

The Information Value is the most interesting statistical process because it measures the amount of information that a variable can give while designing model. It is measured in basis of the deviation of the values within the variables. It is based upon the Information Theory. The range of this score is 0-3.

Example :

The above is just an example of how the table might look. From the above table it is clear that the variable 1 has the highest information-value but less Chi-Squared than Variable 2 and negative correlation.

Whereas , the variable 3 has the least information and also the Chi-squared and Spearman Correlation, thus it can be removed from the analysis, UNLESS, you consider that it is of some value from the business perspective.

After the above process is carried out, the variables can be scored a new value depending upon these three scores. Further, a new ranking can be carried out for all the variables by taking this new score into account. This allows a new perspective for choosing a variable.

The whole process can be in form of iterative process because the result all depends upon the sampling algorithm that has been used. Different sample can give different results, thus the process can be lengthy and laborious. 

Reducing Redundancy in variable

Why is it necessary to reduce redundancy in the variable? This has actually many reasons some of them are as follow :
  • Over crowding the model with many variables with no purpose.
  • Reduce the significance of the predicted co-efficient of the parameters.
  • Risk of out fitting the model.
  • Can destabilise the estimates.
  • Also increase the computation time which can be crucial when the data is in millions. 
How can we identify the redundancy?

In order to identify the redundancy, correlation must be carried out among the variables and not with the outcome variables. After that, the variables must be ranked according to their correlation factor. Then, among the clusters, one may be picked by reference to the three scores discussed above.

There are many commercial soft wares available for the above process. The one that comes to my mind is SAS. But other statistical software like MatLab can also carry out the operation with some programming module written to it.


Popular posts from this blog

Selling a Comb to a Bald Person?

Here my friend, Ashay, put it very truly to me that the marketer's most challenge is to sell a comb to a bald. First, I am not trying to justify anything here. But I just couldn't help thinking how on earth am I going to sell a comb to a bald. How? Just how? I kept pondering upon it till late night. I actually had very few options with me, the first was obviously to use Google and Find? :) But, I didn't do that. Some how I was still in confusion. Then just before going to sleep, I had an discussion with my other friend, on types of marketing on issues related to customer centric marketing. Hmm. Then some how it hit me. I went back to basic on my own philosophy, sell things that is needed. So here is a small anecdote I prepared : Sale Person   : Hello sir. How are you? Do you have a time, plzzz? Bald Person : (Almost confused and in social causality) OK OK what is it? I don't have time. Sale Person    : Here sir, do you want to by a comb?  Bald Person  : Can't you

Good Ad Versus Bad Ad

At InRev , for past few days we have been working on our new project. The project is about collaborative blogging, from the blogger around the world. As Bhupendra and I have been blogging for long time, with his immense experience, our team has been working to revamp the globalthoughtz and create a new blogging experience. In this site, we are planning to add a section where people can add their advertisement. We thought lets start by our own :), with already so many products on the line, it was obvious. So there goes an effort to create a small 128 px by 128 px logo. If you are a designer, you know that designing is a very time consuming process. For hours, you just test around with colors. Remember there are 256 X 256 X 256 combination of colors!! Its hard job selecting one. Lets start with F-Cube : In short F-cube is a Free Economic Reporting Site. Let me exhibit some Ad made for it and its pros and cons. (Please don't click on the image. Wait till the end. Thank You) OK, h

Fearful Consumer Market

Consumer market is something, I always feared. During my engineering days, I knew it was one area where I would not find myself working. I always feared the harsh competition of the market. I worried if ever, anything I made would sustain in the market. Or how people would react to it? You can say, I feared criticism and all the yap yap of group of people, who knows only how to suggest but not to act. Thus, I kept my interest into custom projects and not related to anything that a single consumer would use, rather it was something of community service. But with changing time, I knew I had to make a plunge into the ocean of consumer market and face the competition. "Be a man! Dude" That is what I would say to myself. I knew I couldn't swim, but I had to give it a try.  Journey into the consumer market is like that of 20000 Leagues Under The Sea . There are so many different kinds of creatures around to look and be fascinated. Some are small living in tiny groups. Some are