Skip to main content

Variable Analysis


During the analysis of the variables and their relationship with the outcome variable, we can choose Bi variate or Multivariate. Both have their own advantages and disadvantages. 

Bi variate considers the effect of a single variable on the outcome variable, thus ignoring the effect of the other. Depending upon the correlation factor the variables can be either chosen for further analysis or else rejected. But, rejecting a variable just on the bases of correlation factor is not so wise decision. 

On the other hand, multivariate analysis checks the relationship between the outcome variable and all the other independent variables. From this type of analysis we get a more clear picture on how all the variables are affecting the outcome or dependent variables. The scenario in this case can be described as multi dimensional, cause there are more than 2/3 variables.

In the process of modeling, it becomes a very important for us to select the right variables so that the model can predict with high accuracy. In such case, running both the analysis step by step will reveal the more appropriate choices.

But, the choices must not be always made on the basis of the numerical values or percentages, sometime , it is be more wise to include a variable which may seem to have effect in the future or in the different conditions even though it may have low score.

Methods of choosing the variables

There are many statistical test available in order to choose the relevant variable. Few important ones are as follows:

3. Information Value

1. Chi-squared

The above method gives the correlation between the predictive variables and the log of the odds of a bad outcome. This allows to measure the predictive power of the variable, meaning - how important can this variable be for building a predictive model for the outcome variable.

2. Spearman correlation

The above method gives the correlation  between the ranking of the predictive variables and the outcome variables and not the real values of the variable. In this analysis, the relationship doesn't have to be linear, it just has to be proportional either in negative or positive sense.

3. Information Value

The Information Value is the most interesting statistical process because it measures the amount of information that a variable can give while designing model. It is measured in basis of the deviation of the values within the variables. It is based upon the Information Theory. The range of this score is 0-3.

Example :


The above is just an example of how the table might look. From the above table it is clear that the variable 1 has the highest information-value but less Chi-Squared than Variable 2 and negative correlation.

Whereas , the variable 3 has the least information and also the Chi-squared and Spearman Correlation, thus it can be removed from the analysis, UNLESS, you consider that it is of some value from the business perspective.


After the above process is carried out, the variables can be scored a new value depending upon these three scores. Further, a new ranking can be carried out for all the variables by taking this new score into account. This allows a new perspective for choosing a variable.

The whole process can be in form of iterative process because the result all depends upon the sampling algorithm that has been used. Different sample can give different results, thus the process can be lengthy and laborious. 

Reducing Redundancy in variable

Why is it necessary to reduce redundancy in the variable? This has actually many reasons some of them are as follow :
  • Over crowding the model with many variables with no purpose.
  • Reduce the significance of the predicted co-efficient of the parameters.
  • Risk of out fitting the model.
  • Can destabilise the estimates.
  • Also increase the computation time which can be crucial when the data is in millions. 
How can we identify the redundancy?

In order to identify the redundancy, correlation must be carried out among the variables and not with the outcome variables. After that, the variables must be ranked according to their correlation factor. Then, among the clusters, one may be picked by reference to the three scores discussed above.

There are many commercial soft wares available for the above process. The one that comes to my mind is SAS. But other statistical software like MatLab can also carry out the operation with some programming module written to it.

Comments

Popular posts from this blog

Good Ad Versus Bad Ad

At InRev, for past few days we have been working on our new project. The project is about collaborative blogging, from the blogger around the world. As Bhupendra and I have been blogging for long time, with his immense experience, our team has been working to revamp the globalthoughtzand create a new blogging experience. In this site, we are planning to add a section where people can add their advertisement. We thought lets start by our own :), with already so many products on the line, it was obvious.

So there goes an effort to create a small 128 px by 128 px logo. If you are a designer, you know that designing is a very time consuming process. For hours, you just test around with colors. Remember there are 256X 256X 256combination of colors!! Its hard job selecting one.
Lets start with F-Cube: In short F-cube is a Free Economic Reporting Site. Let me exhibit some Ad made for it and its pros and cons. (Please don't click on the image. Wait till the end. Thank You)
OK, how does thi…

Selling a Comb to a Bald Person?

Here my friend, Ashay, put it very truly to me that the marketer's most challenge is to sell a comb to a bald. First, I am not trying to justify anything here. But I just couldn't help thinking how on earth am I going to sell a comb to a bald. How? Just how? I kept pondering upon it till late night. I actually had very few options with me, the first was obviously to use Google and Find? :)

But, I didn't do that. Some how I was still in confusion. Then just before going to sleep, I had an discussion with my other friend, on types of marketing on issues related to customer centric marketing. Hmm. Then some how it hit me. I went back to basic on my own philosophy, sell things that is needed. So here is a small anecdote I prepared :
Sale Person: Hello sir. How are you? Do you have a time, plzzz?
Bald Person: (Almost confused and in social causality) OK OK what is it? I don't have time.
Sale Person: Here sir, do you want to by a comb? 
Bald Person : Can't you see I am bald? …

Waiting for the right bus

Everyday, after college I would walk home with my friends. Wandering around and talking about future. What will we do? How will we be? How much will we change? Ideas! Ambitions! Life!!!

Everything always tumbled down to a cup of tea at a near by shop, where we were regular visitors. A sip of tea and bunch of thought provoking questions and contradictions. All intelligent and weird minds searching for the answer to a common question - the meaning of life. Trying to find that right path towards salvation. Complete Consciousness!!

Now and then we would simply get lost within ourselves, in the midst of the crowd. Thoughts wandering far beyond the time frame. Everyone building their life on their own dimensions. Confident. Lost. But hopeful.

Like some lost sailors on a boat, every one of us would be looking at different directions hoping to see some signs of calling. But then we wouldn't find any. "What are we searching for ?", would be one damning question in every one'…