# An inverse relationship between two variables will yield a

### Statistics Ground Zero/Association - Wikibooks, open books for an open world What does {term} mean Inverse Correlation Second, the relationship between two variables is not static and fluctuates over time, which. Can you think of other scenarios when we would use bivariate data? When the points on a scatterplot graph produce a lower-left-to-upper-right pattern (see When there is no linear relationship between two variables, the. Hypothesis. A tentative statement about the relationship between two or more variables .. An inverse relationship between two variables will yield a. An inverse.

But, of course, any result can still occur by chance. One could on a first try blindly select the one of two white balls out of One could win a lottery with odds of a million to one, also. But if on different observations for different years and nations, we continue to get such a correlation, then our confidence in discarding chance as a possibility increases--our conviction grows that there is some underlying relationship or cause, as we would suspect something other than chance if a person won the Irish sweepstakes three years in a row.

However, assume we had hypothesized that a non-zero correlation exists between trade and defense budget, that we selected nations and observation in a way not to favor our hypothesis, and then we computed the correlation of. The probability of getting by chance such a correlation, or higher, is less than one out of two-hundred times, if in fact the correlation should be zero.

This suggests that our hypothesis is correct. Correlations among data collected to test previously stated hypothesis always have more power than correlations which are simply assessed exploratory. Babe Ruth's famous home run slammed over the centerfield he had just pointed to, gave him stature unattainable by any unpredicted home run.

The data on trade and defense budget on 14 nations could have been collected such that from the correlations inferences about all nations could be made. To do this requires selecting the sample in a random or stratified manner so as to best reflect the population of nations.

For example, such a sample might be collected of students attending the University of Hawaii to determine the correlation between drug use and grades; of Hawaiian residents to assess the correlation between ethnicity and liberalism in Hawaii; of 1, national television viewers to ascertain the correlation between programming and violence.

Now, assume the fourteen nations we have used to assess the correlation between trade and defense budget is a good sample, i. Then what inference about all nations can be made from a correlation of. A useful way of answering this is in terms of an alternative hypothesis.

## Statistics Ground Zero/Association

That is, what is the chance of a plus. The answer to this is given by the probability levels in Table 9. This can be interpreted as follows: The probability is less than. With such a low probability of error, we might confidently reject this hypothesis, and accept that there is a positive correlation between trade and defense budgets for all nations.

In other words, we can infer that our sample results reflect the nature of the population. They are statistically significant.

What if the alternative hypothesis were that a zero correlation exists between the two variables? Then, our concern would be with the probability of getting a plus or minus sample correlation of. This is a "two-tailed" probability in the sense that we are after the chance of a plus or minus correlation. Reference to Table 9. The probability of wrongfully rejecting the hypothesis of a population zero correlation is thus less than one out of a hundred. Therefore, most would feel confident in inferring that a non-zero correlation exists between trade and defense expenditures. One is the likelihood of getting by chance the particular correlation or greater between two sets of magnitudes; the second is the probability of getting a sample correlation by chance from a population.

In either case, the significance of a result increases--the probability of the result being by chance decreases--as the number of cases increases. This can be seen from Table 9. Simply consider the column in the table for the probability of.

### linear regression and correlation coefficients

For an N of 5 a correlation must be as high as. Therefore, very small correlations can be significant at very low probabilities of their being chance results, even though the variance in common is nil. Clearly, one can have significant results statistically, when there is very little variation in common. A high significance does not mean a strong relationship. Even though for 1, cases, Which should one consider, then? Significance or variance in common?

This depends on what one is testing or concerned about. If one wants results from which to make forecasts or predictions, correlations of even. If one's results are to be a base for policy decisions, only a high percent of variance in common may be acceptable. But if one is interested in uncovering relationships, no matter how small, then significance is of concern.

Say we had computed the correlation between trade and defense budget for all nations and found. Could we ask whether this is significant? Yes, when we keep in mind the two types of significance. Clearly, this is not a sample correlation and sample significance is meaningless. But, we can assess the likelihood of this being a chance correlation between the two sets of magnitudes for all nations, as described in Section 9. Fortunately, both types of significance can be assessed using the same probability table, such as Table 9.

As the data depart from these assumptions, the tables of probabilities for the correlation are less applicable. Both types of significance described here assume a normal distribution for both variables, i. When sampling significance is of concern, the observations are assumed drawn from a bivariate normal population. That is, were the frequencies of observation plotted for both variables for the population, then they would be distributed in the shape of a bell placed on the middle of a plane, with the lower flanges widening out and merging into the flat surface.

By virtue of these distributive requirements, the assessment of significance also demands that the data be interval. For this, one must use a different type of correlation coefficient, of which the next chapter will give examples.

Statisticians have formulated a systematic design, called "tests of hypotheses," for making a decision to reject or accept statistical hypotheses. Most elementary statistical text books have a chapter or so dealing with this topic. It is the most widely used coefficient and for many scientists the only one.

Indeed, most computer programs computing correlations employ the product moment without so informing their users in the program write up. However, while this seems straightforward, we need to figure out what we mean by 'best', and that means we must define what it would be for a line to be good, or for one line to be better than another, etc.

Specifically, we must stipulate a loss function. A loss function gives us a way to say how 'bad' something is, and thus, when we minimize that, we make our line as 'good' as possible, or find the 'best' line. Traditionally, when we conduct a regression analysis, we find estimates of the slope and intercept so as to minimize the sum of squared errors. These are defined as follows: This sounds very similar, but is not quite the same thing. The way to recognize this is to do it both ways, and then algebraically convert one set of parameter estimates into the terms of the other.

Comparing the first model with the rearranged version of the second model, it becomes easy to see that they are not the same. The important thing to realize is that which of your two variables you cast into either role does make a difference in the results.

• Statistics review 7: Correlation and regression
• Multicollinearity

The convention that is built into statistical software packages is that x is your independent variable and y the dependent variable. But what is the difference? In a controlled laboratory situation, the independent variable is the one the experimenter controls, and the dependent is the variable of interest that is measured for different values of the independent variable.

If you are looking at the solubility as a function of temperature, temperature would be the independent variable and mass per unit volume of the material dissolved the solubility would be the dependent variable. Often there is a stated or unstated assumption that the dependent variable is controlled by the independent one; i.

So temperature changes causes solubility to change. A change in solubility is not thought to cause a change in temperature. Note that you could theoretically use solubility to measure temperature, and so the role of independent vs.

What is the dependent vs. Those used to working in the luxury of a controlled laboratory situation may be forgiven for thinking that there is always a clear cut distinction between dependent vs. The real world is more complex. For example, one can investigate whether there is a statistical relationship between Si and K content in a suite of volcanic rocks. Which one is the dependent and which one is the independent variable?

Typically Si is taken as the x coordinate, but why? Or perhaps one is looking at the relationship between two different contaminants in an aquifer. This is not a controlled situation. Again, context is the guide. Very often it is helpful to assign the variable you want to predict, or know less about as the dependent variable. For example, if one contaminant was a derivative of the other and an inverse relationship could be expected in any water sample, then the derivative compound would be the dependent variable.

If one contaminant was much easier to measure and on the basis of a relationship between the two, was to be used as a proxy indicator or indicator of another then the proxy would be the independent variable. The most important think is to have thought this out before proceeding with the analysis or presenting your results.

Establishing a mathematical relationship does not mean you have established a causal relationship. The results of a linear regression are often termed the best-fit line. What does this mean? If you imagine a regression line the plot of a linear equation and the scatter plot of points that produced it, then imagine the vertical lines y distance between each point and the regression line, you have one image of goodness of fit.

The smaller those distances the better the fit. Combine those into an aggregate length. This length is a measure of how vertically close the y values are to the regression line. In a perfect fit, there would no difference, with the points plotting right on the line, and the aggregate length would be 0. Different regression lines for the same data produce different aggregate lengths. The statistical routine in Excel and other statistical packages computes the line of minimum deviation, of minimum aggregate length, the one that the points are, in aggregate, closest to.

Which variable is assigned to x and which to y does make a difference. As an experiment you can put in some real data and run the linear regression both ways interchange x and y as independent and dependent variables.

Simply because in one case you are minimizing the variation for y the conventional caseand in the other you are minimizing the variation for x. In a regression of y as the dependent variable, given x, the aggregate values for all the data points of distance A is minimized in the best fit routine.

If you reverse the roles of x and y then distance B is minimized instead, and hence you can get a different answer. It is possible to minimize C also. Natural system feedback loops provide an interesting difficulty because there is not a one way causal relationship. Instead, both variables are interdependent. Snowcover and local air temperature could be one example. Of course, the air temperature determines how much snow melts or doesn't, but also the albedo of the snow affects local air temperature. Try to think of others. What to do in this situation? There is a type of linear regression that instead of seeking to minimize error of line fit only in the y variable, minimizes the error in both x and y.

This is described in your Swan and Sandilands reference, and is referred to as structural regression. As you might guess it is more difficult, and we won't treat it here. However, you should look into it when the occasion arises. In any case, in your work you should clearly state which is your independent and which is your dependent variable has this been stressed enough?

The reader can agree or disagree with your call and take it from there. This will be a group project. Get into groups of 3. Think of some potential relationship between two earth science variables that could be interesting to explore. It could be, for example, between average annual rainfall and sedimentation rate in a lake. For that relationship address the following questions: Why would you expect a relationship, i.

Which one should be the independent and which one should be the dependent variable? Where or how might you obtain the needed data? What type of relationship might you expect?