## A Little Primer on Multicollinearity

May 1, 2009

[First published September 25, 2005] Many of the studies on the democratic peace I deal with here, including mine, use multiple regression, and involve the problem of multicollinearity. Now, I want to deal with this problem to aid readers and students in understanding what it is, why it is a problem, and what to do about it. I meant this to be more conceptual than technical, as I did regarding the correlation coefficient (here and factor analysis (here), but it would be much too long for a blog. So, I will have to go technical. For a brief, largely nontechnical introduction to regression analysis, go here.

First, I have to distinguish descriptive MR from inferential. Few applied statistics books do this, for it is almost universally assumed that one is analyzing a sample of data collected from and assumed to properly represent a population (sometimes called the universe), as an analysis of the IQ and income on a randomly selected group of 100 people might be assumed to reflect the variation in these variables for all people, or an opinion poll of 1,000 selected adults might represent the opinions of all Americans. But, in political analysis, we often deal with the total universe, such as of all Senators, all nations, or all Supreme Court Justices. Then our analysis is not inferential, but descriptive, and tests of significance are not applicable, except in a special sense. (It has been argued that these are actually samples from history, but then they are not random in any sense) I will deal with this in another blog, and assume here that we are dealing with all cases — the total universe.

UNDERSTANDING REGRESSION COEFFICIENTS

For simplicity, I will deal only with two independent variables, although theoretically there could be a dozen or more. In general terms, consider the function y = a + bx + cz + e, where: y is the dependent variable, a the intercept, and b and c the regression coefficients (weights, constants) for the independent variables x and z, and e is the error of estimate of y (or the residuals). One would hope in fitting this function to a set of data (called regression analysis), that e is minimized such that the independent variables provide a good fit to (explanation or prediction of) y. This fit is assessed by the squared multiple correlation (SMC), which gives the amount of variance in y that is accounted for by (linearly related to) the independent variables.

The standard but not only way of minimizing e is called least squares, which virtually every MR statistical application does.

Now, let us say that the variables are all standardized (each variable’s mean is subtracted out, and the result is divided by the variable’s standard deviation). The resulting standardized variables have a mean = 0, and standard deviation = 1. The virtue of standardization is that it makes variables measured in different units comparable, such as exports in dollars and deaths per 100,000 people. The sum of the products of the standard scores on two variables divided by the number of cases = their (product moment) correlation coefficient.

Now, define the squared correlation coefficient as SCC. If all the data are so standardized, then the SMC = SCC(y,x) + SCC(y,z, holding x constant) times (1-SCC(y,x)), which is to say that the proportion of variance in y accounted for by the independent variables = (that explained alone by x) + (the additional variance explained by z)(the variance in y unexplained by x)

The additional variance explained by z is the squared partial correlation coefficient between y and z, holding x constant. If there is no multicollinearity in the standardized independent variables, then their partial correlation coefficients with y, holding the other variable constant, equals their correlation coefficients with y. And the regression coefficients (sometimes called beta weights or beta coefficients) are simply the squared correlations of x and z with y. Moreover, note that if there is multicollinearity, then the regression coefficient for z depends in part on the partial correlation coefficient it has with y controlling for x.

Perhaps a simpler way of seeing this is to consider a bivariate regression of y just on x with error e. Then when the second variable is included, it would account for y-e, that is the residuals — the variance left in y after that accounted for by x is removed.

WHAT IS MULTICOLLINEARITY?

This occurs when the independent variables are intercorrelated, that is they are linearly interrelated. The term intercorrelated is crucial, since it is not the simple correlation between the variables that is critical, but the linear dependence of each independent variable on all the others. If there is no such linear dependence, there is no multicollinearity.

WHY IS THIS IMPORTANT?

Usually, and especially in the social sciences, independent variables will have some linear relationship. Depending on the degree of multicollinearity, the effect can be very misleading in interpreting the regression coefficients. In the worst case, the regression coefficients will be effected by substantive and random errors in the data such that they are noncomparable one to another, and descriptively uninterpretable.

Were this regression analysis done on a sample and inferential statistics applied (e.g., tests of significance), then the standard error of the regression coefficient would be much enlarged and thereby the t-test would be sharply reduced in apparent significance, and even perhaps show no significance for any of the regression coefficients. One could have a scratch-one’s-head regression fit where the multiple R is large, while no regression coefficient is significant.

Even the signs of the regression coefficients may be changed. Let b(y,x) stand for the regression coefficient between the dependent variable y and independent variable x, and r(y,x) stand for their correlation. If b and r(y,x) have the same sign, then the bias in the regression coefficient is upward; otherwise, the bias is downward.

If one is seeking to compare the causal effect of variables, as in Gartzke trying to compare the effects of democracy versus economic freedom on violence, high multicollinearity makes such a comparison difficult in part because of the involvement of partial correlations for successive variables. Not only is it inflating or deflating the regression coefficients, but also may even be changing their signs.

HOW TO GAUGE MULTICOLLINEARITY

This is a problem because multicollinearity is a continuous function of the intercorrelations among the independent variables. There is no easy way to do say that there is or is not too much multicollinearity except at the extremes. But, there are measures we can use.

Consider the matrix of independent variables X (variables by column, cases by row). In getting a least squares solution, the symmetrical matrix X’X, (where X’ is the transpose of X) is calculated. Then if the determinant (D) of X’X = 0, the independent variables have perfect multicollinearity. As D –> 0, the resulting regression involving X increasingly involves random and substantive error. If the variables in X are standardized, then X’X divided by the number of cases = R, the correlation matrix — the usual case in statistical applications. Then, perfect nonmulticolinearity, or linear independence obtains when D = 1.

However, regretfully, few MR programs calculate D, but D = 0, then the inverse of R is singular, and the statistical application will usually warn the user with something like, “Error — calculations cannot be done because of singularity.” D, however, may be close to zero. Then the inverse can be computed, although the result is mush. Another approach is to use a factor analysis program to get the eigenvalues of R, which are often listed along with the other results. If any are zero or near zero, then this is the same as D being zero or near zero. If there is one huge eigenvalue (the average eigenvalue = 1) and the rest are very small by comparison, the variables in X are multicollinear. Note that if the eigenvalues are not given, the percent of variance accounted for by a factor (component) is a function of the eigenvalue. The more one factor accounts for the variance in the data compared to other factors, the greater the multicollinearity. (warning — if you do a lot of pair-wise removal of missing data and the correlations are therefore calculated for variables with different numbers of cases, R can be non-Gramian, thus inflating some eigenvalues of R and even making some negative, frightening mathematicians, and causing distorted regression coefficients.)

Another method is to regress each of the independent variables on all the others (not including y). If the SMC for each regression is small, such as an SMC less than .25 (which means the other independent variables account for 25 percent of the variance in this one regressed against them), then multicollinearity is not a problem. If any SMC = 1, or approaches 1, then there is dangerous multicollinearity, and that independent variable should be removed from the regression.

A wrong approach is to look at the correlations among independent variables as a measure of multicollinearity. The problem with this is that the correlation between any two variables may be a function of their correlation with the other variables. Thus, the correlation between x and z may be near zero due to their correlations with another variable w. Remove w, and that of x and z may jump.

WHAT TO DO?

If one suspects or finds multicollinearity, what can be done? Transform the independent variables such that they are all uncorrelated, that is so D = 1 for their correlation matrix. Because of partial correlations, simply selecting out variables that have high correlations cannot do this. Rather, one has to take account of all partials and correlations simultaneously. That is, orthogonalize the independent variables.

This can be done by a factor analysis (component analysis) of the independent variables X, which almost all major statistical programs have as an option. This will produce linearly independent factors and factor scores. The factor scores (unrotated or othogonally rotated) best reflect the variation among the independent variables and absolutely no multicollinearity. Their correlation matrix will have D = 1. They thus can be used in place of the original independent variables.

If one is wary of using factor scores, which may seem to lose touch with the original, nicely interpretable, independent variables, then one can substitute the highest loading variable on each factor instead. There may at most be only slight multicollinearity introduced into the MR as a result.

I’ve tried to keep this short and thus have left out the full regression model and many examples that could be included. However, the whole approach can be seen in the Appendix to my Saving Lives (here). There my research question was: “What best accounts for human security, taking into account democracy, economic freedom, demographics, culture, and so on? This is a MR question, but for independent and dependent variables that have high multicollinearity (in some cases the SMC of some independent variables regressed on all the others was 1.0, .99. or 98). Therefore, I orthogonalized the data through a series of factor analyses, and then I carried out the regression with the factor scores. Result:

For all nations 1997 to 1998, the human security of their people, their human and economic development, the violence in their lives and the political instability of their institutions, is theoretically and empirically [mainly] dependent on their freedom–their civil rights and political liberties, rule of law, and the accountability of their government. One can well predict a people’s human security by knowing how free they are.

Moreover, just considering the violence, instability, and total deaths a people can suffer, the more freedom they have the less of this they will endure. ### “The Ignorant Freedomist”

Eunomia Blog of Daniel Larson:

What can one say in the face of such foolishness? I have occasionally encountered Mr. Rummel’s ramblings about HYPERLINK “http://larison.org/archives/000076.php””democratic peace” and HYPERLINK “http://larison.org/archives/000055.php””freedomism&#8221; before, and I have wasted little time on taking them seriously, but the troubling thing is that Mr. Rummel’s bizarre theory readily wins acceptance in conventional thinking. . . . But even a brief, cursory glance at history would tell us this political theory is simply false and has virtually no supporting evidence. . . .

Mr. Rummel’s claim that there have never been wars between democracies make him either an historical ignoramus of the first order or a dishonest hack. I sincerely hope it is the former, as this is at least remediable.

Mr. Rummel’s simplistic theory of “democratic peace” reveals something about democrats and democratists that is not often commented on. There is in this theory the naive faith that there is a type of regime that guarantees an end to war, which is to seek a mechanistic and institutional cure to something that originates in the sinful will of man, man’s boundless acquisitiveness and the finite resources of the world. It is what Voegelin might have called a gnostic faith. . . .

. . . magical thinking . . . .the nonsense. . . . Most other democratists, keenly aware that the “democratic peace” idea is either an embarrassment while there is a democratic war of aggression going on or that it is simply false . . . . The Democratic Peace and Territorial Conflict in the Twentieth Century By Paul K. Huth and Todd L Allee:

Their statistical results provide strong support for the importance of democratic accountability and norms in shaping decisions to negotiate and settle disputes as well as to threaten force and escalate to war.

The North American Democratic Peace By Stéphane Rousse:

Since the nineteenth century war seems to have been banished as a way to solve conflict between Canada and the United States. Why did this happen and why have the two states developed a relationship of cooperation that is much more “egalitarian” than one would expect, given their very different levels of power.

According to Samuel Huntington, “the democratic peace thesis is one of the most significant propositions to come out of social science in recent decades.” If true, it has crucially important implications for both theory and policy.” My purpose today is to take a glimpse at the theory and, because the democratisation literature is generally weak on Asia, ask whether democratic peace theory has any application to East Asia.

It’s the Democracy, Stupid! By Per Ahlmark.

“The first part of the book consists of a long essay on the miracles of democracy. No democracies have gone to war with one another. Democracies should work together against totalitarian states.  Democratic Peace Bibliography