How many European Commissioners are in the European Commission. European Commission. The European Commission is the main executive body of the European Union. Activities of the European Commission

Correlation coefficient is a value that can vary from +1 to -1. In the case of a complete positive correlation, this coefficient is equal to plus 1 (they say that with an increase in the value of one variable, the value of another variable increases), and with a complete negative correlation - minus 1 (indicate feedback, i.e. When the values ​​of one variable increase, the values ​​of the other decrease).

Ex 1:

Dependence graph of shyness and depression. As you can see, the dots (subjects) are not located randomly, but line up around one line, and, looking at this line, we can say that the higher the shyness is expressed in a person, the more depressive, i.e. these phenomena are interconnected.

Ex 2: Graph for Shyness and Sociability. We see that as shyness increases, sociability decreases. Their correlation coefficient is -0.43. Thus, a correlation coefficient greater from 0 to 1 indicates a directly proportional relationship (the more ... the more ...), and a coefficient from -1 to 0 indicates an inversely proportional relationship (the more ... the less ...)

If the correlation coefficient is 0, both variables are completely independent of each other.

correlation- this is a relationship where the impact of individual factors appears only as a trend (on average) with the mass observation of actual data. Examples of correlation dependence can be the dependence between the size of the bank's assets and the amount of the bank's profit, the growth of labor productivity and the length of service of employees.

Two systems of classification of correlations according to their strength are used: general and particular.

The general classification of correlations: 1) strong, or close with a correlation coefficient of r> 0.70; 2) medium at 0.500.70, and not just a correlation of a high level of significance.

The following table lists the names of the correlation coefficients for different types of scales.

Dichotomous scale (1/0) Rank (ordinal) scale
Dichotomous scale (1/0) Pearson's association coefficient, Pearson's four-cell conjugation coefficient. Biserial correlation
Rank (ordinal) scale Rank-biserial correlation. Spearman's or Kendall's rank correlation coefficient.
Interval and absolute scale Biserial correlation The values ​​of the interval scale are converted into ranks and the rank coefficient is used Pearson correlation coefficient (linear correlation coefficient)

At r=0 there is no linear correlation. In this case, the group means of the variables coincide with their general means, and the regression lines are parallel to the coordinate axes.

Equality r=0 speaks only of the absence of a linear correlation dependence (uncorrelated variables), but not in general about the absence of a correlation, and even more so, a statistical dependence.

Sometimes the conclusion that there is no correlation is more important than the presence of a strong correlation. A zero correlation of two variables may indicate that there is no influence of one variable on the other, provided that we trust the results of the measurements.

In SPSS: 11.3.2 Correlation coefficients

Until now, we have found out only the very fact of the existence of a statistical relationship between two features. Next, we will try to find out what conclusions can be drawn about the strength or weakness of this dependence, as well as about its form and direction. Criteria for quantifying the relationship between variables are called correlation coefficients or measures of connectivity. Two variables are positively correlated if there is a direct, unidirectional relationship between them. In a unidirectional relationship, small values ​​of one variable correspond to small values ​​of the other variable, large values ​​correspond to large ones. Two variables are negatively correlated if there is an inverse relationship between them. With a multidirectional relationship, small values ​​of one variable correspond to large values another variable and vice versa. The values ​​of the correlation coefficients are always in the range from -1 to +1.

Spearman's coefficient is used as a correlation coefficient between variables belonging to the ordinal scale, and Pearson's correlation coefficient (moment of products) is used for variables belonging to the interval scale. In this case, it should be noted that each dichotomous variable, that is, a variable belonging to the nominal scale and having two categories, can be considered as ordinal.

First, we will check if there is a correlation between the sex and psyche variables from the studium.sav file. In doing so, we take into account that the dichotomous variable sex can be considered an ordinal variable. Do the following:

Select from the command menu Analyze (Analysis) Descriptive Statistics (Descriptive statistics) Crosstabs... (Contingency tables)

· Move the variable sex to a list of rows and the variable psyche to a list of columns.

· Click the Statistics... button. In the Crosstabs: Statistics dialog, check the Correlations box. Confirm your choice with the Continue button.

· In the Crosstabs dialog, stop displaying tables by checking the Supress tables checkbox. Click the OK button.

The Spearman and Pearson correlation coefficients will be calculated, and their significance will be tested:

/ SPSS 10

Task number 10 Correlation analysis

The concept of correlation

Correlation or correlation coefficient is a statistical indicator probabilistic relationships between two variables measured on quantitative scales. In contrast to the functional connection, in which each value of one variable corresponds to strictly defined the value of another variable, probabilistic connection characterized by the fact that each value of one variable corresponds to set of values Another variable, An example of a probabilistic relationship is the relationship between height and weight of people. It is clear that people of different weights can have the same height and vice versa.

Correlation is a value between -1 and + 1 and is denoted by the letter r. Moreover, if the value is closer to 1, then this means the presence of a strong connection, and if it is closer to 0, then a weak one. Correlation value less than 0.2 is considered as weak correlation, more than 0.5 - high. If the correlation coefficient is negative, this means that there is an inverse relationship: the higher the value of one variable, the lower the value of the other.

Depending on the accepted values ​​of the coefficient r, different types of correlation can be distinguished:

Strong positive correlation is determined by the value r=1. The term "strict" means that the value of one variable is uniquely determined by the values ​​of another variable, and the term " positive" - that as the value of one variable increases, the value of the other variable also increases.

Strict correlation is a mathematical abstraction and almost never occurs in real research.

positive correlation corresponds to the values ​​0

Lack of correlation is determined by the value r=0. A correlation coefficient of zero indicates that the values ​​of the variables are not related to each other in any way.

Lack of correlation H o : 0 r xy =0 formulated as a reflection null hypotheses in correlation analysis.

negative correlation: -1

Strong negative correlation determined by the value r= -1. It, like a strict positive correlation, is an abstraction and does not find expression in practical research.

Table 1

Types of correlation and their definitions

The method of calculating the correlation coefficient depends on the type of scale on which the values ​​of the variable are measured.

Correlation coefficient rPearson is the main one and can be used for variables with nominal and partially ordered interval scales, the distribution of values ​​over which corresponds to normal (correlation of product moments). The Pearson correlation coefficient gives fairly accurate results in cases of abnormal distributions as well.

For distributions that are not normal, it is preferable to use the Spearman and Kendall rank correlation coefficients. They are ranked because the program pre-ranks the correlated variables.

The SPSS program calculates the r-Spearman correlation as follows: first, the variables are converted to ranks, and then the Pearson-formula is applied to the ranks.

The correlation proposed by M. Kendall is based on the idea that the direction of the connection can be judged by comparing the subjects in pairs. If for a pair of subjects the change in X coincides in direction with the change in Y coincides, then this indicates a positive relationship. If it does not match, then about a negative relationship. This coefficient is used mainly by psychologists working with small samples. Since sociologists work with large data arrays, it is difficult to sort through pairs, identify the difference in relative frequencies and inversions of all pairs of subjects in the sample. The most common is the coefficient. Pearson.

Since the correlation coefficient rPearson is the main one and can be used (with some error depending on the type of scale and the level of abnormality in the distribution) for all variables measured on quantitative scales, we will consider examples of its use and compare the results obtained with the results of measurements using other correlation coefficients.

The formula for calculating the coefficient r- Pearson:

r xy = ∑ (Xi-Xav)∙(Yi-Yav) / (N-1)∙σ x ∙σ y ∙

Where: Xi, Yi- Values ​​of two variables;

Xav, Yav - average values ​​of two variables;

σ x , σ y are standard deviations,

N is the number of observations.

Pair correlations

For example, we would like to find out how the answers between various types traditional values ​​in students' ideas about the ideal place of work (variables: a9.1, a9.3, a9.5, a9.7), and then about the ratio of liberal values ​​(a9.2, a9.4. a9.6, a9. eight) . These variables are measured on 5-term ordered scales.

We use the procedure: "Analysis",  "Correlations",  "Paired". By default, the coefficient Pearson is set in the dialog box. We use the coefficient Pearson

The tested variables are transferred to the selection window: a9.1, a9.3, a9.5, a9.7

By pressing OK, we get the calculation:

Correlations

a9.1.t. How important is it to have enough time for family and personal life?

Pearson correlation

Value(2-sided)

a9.3.t. How important is it to not be afraid of losing your job?

Pearson correlation

Value(2-sided)

a9.5.t. How important is it to have such a boss who will consult with you when making this or that decision?

Pearson correlation

Value(2-sided)

a9.7.t. How important is it to work well-coordinated team feel like a part of it?

Pearson correlation

Value(2-sided)

** Correlation is significant at the 0.01 level (2-sided).

Table of quantitative values ​​of the constructed correlation matrix

Partial correlations:

First, let's build a pairwise correlation between these two variables:

Correlations

c8. Feel close to those who live near you, neighbors

Pearson correlation

Value(2-sided)

c12. Feel close to their family

Pearson correlation

Value(2-sided)

**. The correlation is significant at the 0.01 level (2-sided).

Then we use the procedure for constructing a partial correlation: "Analysis",  "Correlations",  "Partial".

Suppose that the value “It is important to independently determine and change the order of your work” in relation to the indicated variables will be the decisive factor, under the influence of which the previously identified relationship will disappear or turn out to be of little significance.

Correlations

Excluded variables

c8. Feel close to those who live near you, neighbors

c12. Feel close to their family

c16. Feel close to people who have the same wealth as you

c8. Feel close to those who live near you, neighbors

Correlation

Significance (2-sided)

c12. Feel close to their family

Correlation

Significance (2-sided)

As can be seen from the table, under the influence of the control variable, the relationship decreased slightly: from 0.120 to 0.102. it remains sufficiently high and allows one to disprove the null hypothesis with zero error.

Correlation coefficient

The most accurate way to determine the tightness and nature of the correlation is to find the correlation coefficient. The correlation coefficient is a number determined by the formula:


where r xy is the correlation coefficient;

x i -values ​​of the first feature;

i -values ​​of the second feature;

Arithmetic mean of the values ​​of the first feature

Arithmetic mean of the values ​​of the second feature

To use formula (32), we construct a table that will provide the necessary sequence in the preparation of numbers to find the numerator and denominator of the correlation coefficient.

As can be seen from formula (32), the sequence of actions is as follows: we find the arithmetic means of both signs x and y, we find the difference between the values ​​​​of the sign and its average (х i - ) and y i - ), then we find their product (х i - ) ( y i - ) – the sum of the latter gives the numerator of the correlation coefficient. To find its denominator, one should square the differences (x i -) and (y i -), find their sums and extract the square root from their product.

So for example 31, finding the correlation coefficient in accordance with formula (32) can be represented as follows (Table 50).

The resulting number of the correlation coefficient makes it possible to establish the presence, closeness and nature of the relationship.

1. If the correlation coefficient is zero, there is no relationship between the features.

2. If the correlation coefficient is equal to one, the relationship between the features is so great that it turns into a functional one.

3. The absolute value of the correlation coefficient does not go beyond the interval from zero to one:

This makes it possible to focus on the tightness of the connection: the closer the coefficient is to zero, the weaker the connection, and the closer to unity, the closer the connection.

4. The sign of the correlation coefficient "plus" means direct correlation, the sign "minus" means the opposite.

Table 50

x i i (х i - ) (y i - ) (x i - )(y i - ) (х i - )2 (y i - )2
14,00 12,10 -1,70 -2,30 +3,91 2,89 5,29
14,20 13,80 -1,50 -0,60 +0,90 2,25 0,36
14,90 14,20 -0,80 -0,20 +0,16 0,64 0,04
15,40 13,00 -0,30 -1,40 +0,42 0,09 1,96
16,00 14,60 +0,30 +0,20 +0,06 0,09 0,04
17,20 15,90 +1,50 +2,25 2,25
18,10 17,40 +2,40 +2,00 +4,80 5,76 4,00
109,80 101,00 12,50 13,97 13,94


Thus, the correlation coefficient calculated in Example 31 is r xy = +0.9. allows us to draw the following conclusions: there is a correlation between the value muscle strength right and left hands in the studied schoolchildren (the coefficient r xy = +0.9 is different from zero), the relationship is very close (the coefficient r xy = +0.9 is close to unity), the correlation is direct (the coefficient r xy = +0.9 is positive ), i.e., with an increase in the muscle strength of one of the hands, the strength of the other hand increases.

When calculating the correlation coefficient and using its properties, it should be taken into account that the conclusions give correct results when the features are normally distributed and when the relationship between a large number of values ​​of both features is considered.

In the considered example 31, only 7 values ​​of both features were analyzed, which, of course, is not enough for such studies. We remind here again that the examples, in this book in general and in this chapter in particular, are in the nature of illustrating methods, and not a detailed presentation of any scientific experiments. As a result, a small number of feature values ​​are considered, measurements are rounded - all this is done in order not to obscure the idea of ​​the method with cumbersome calculations.

Particular attention should be paid to the essence of the relationship under consideration. The correlation coefficient cannot lead to the correct results of the study if the analysis of the relationship between the features is carried out formally. Let's go back to example 31. Both considered signs were the values ​​of the muscle strength of the right and left hands. Let's imagine that by feature x i in example 31 (14.0; 14.2; 14.9... ...18.1) we mean the length of randomly caught fish in centimeters, and by feature y i (12.1 ; 13.8; 14.2 ... ... 17.4) - the weight of instruments in the laboratory in kilograms. Formally, using the apparatus of calculations to find the correlation coefficient and in this case also obtaining r xy =+0>9, we should have concluded that there is a close relationship of a direct nature between the length of the fish and the weight of the instruments. The absurdity of such a conclusion is obvious.

To avoid a formal approach to using the correlation coefficient, one should use any other method - mathematical, logical, experimental, theoretical - to identify the possibility of a correlation between signs, that is, to detect the organic unity of signs. Only then can one begin to use correlation analysis and establish the magnitude and nature of the relationship.

In mathematical statistics, there is also the concept multiple correlation- Relationships between three or more features. In these cases, a multiple correlation coefficient is used, consisting of the pairwise correlation coefficients described above.

For example, the correlation coefficient of three signs - x і , y і , z і - is:

where R xyz -multiple correlation coefficient expressing how feature x i depends on features y i and z i ;

r xy -correlation coefficient between features x i and y i ;

r xz - correlation coefficient between features Xi and Zi;

r yz - correlation coefficient between features y i , z i

Correlation analysis is:

Correlation analysis

Correlation- statistical relationship of two or more random variables (or variables that can be considered as such with some acceptable degree of accuracy). At the same time, changes in one or more of these quantities lead to a systematic change in the other or other quantities. The correlation coefficient serves as a mathematical measure of the correlation of two random variables.

Correlation can be positive and negative (it is also possible that there is no statistical relationship - for example, for independent random variables). negative correlation - correlation, in which an increase in one variable is associated with a decrease in another variable, while the correlation coefficient is negative. positive correlation - a correlation in which an increase in one variable is associated with an increase in another variable, while the correlation coefficient is positive.

autocorrelation - statistical relationship between random variables from the same series, but taken with a shift, for example, for a random process - with a shift in time.

The method of processing statistical data, which consists in studying the coefficients (correlations) between variables, is called correlation analysis.

Correlation coefficient

Correlation coefficient or pair correlation coefficient in probability theory and statistics, this is an indicator of the nature of the change in two random variables. The correlation coefficient is denoted by the Latin letter R and can take values ​​between -1 and +1. If the modulo value is closer to 1, then this means the presence of a strong connection (with a correlation coefficient equal to one, they speak of a functional connection), and if closer to 0, then a weak one.

Pearson correlation coefficient

For metric quantities, the Pearson correlation coefficient is used, the exact formula of which was introduced by Francis Galton:

Let X,Y- two random variables defined on the same probability space. Then their correlation coefficient is given by the formula:

,

where cov is the covariance and D is the variance, or equivalently,

,

where the symbol denotes the mathematical expectation.

To graphically represent such a relationship, you can use a rectangular coordinate system with axes that correspond to both variables. Each pair of values ​​is marked with a specific symbol. Such a plot is called a "scatterplot".

The method of calculating the correlation coefficient depends on the type of scale to which the variables refer. So, to measure variables with interval and quantitative scales, it is necessary to use the Pearson correlation coefficient (correlation of product moments). If at least one of the two variables has an ordinal scale, or is not normally distributed, Spearman's rank correlation or Kendal's τ (tau) must be used. In the case when one of the two variables is dichotomous, a point two-series correlation is used, and if both variables are dichotomous, a four-field correlation is used. The calculation of the correlation coefficient between two non-dichotomous variables makes sense only if the relationship between them is linear (unidirectional).

Kendell correlation coefficient

Used to measure mutual disorder.

Spearman's correlation coefficient

Properties of the correlation coefficient

  • Cauchy - Bunyakovsky inequality:
if we take the covariance as the scalar product of two random variables, then the norm of the random variable will be equal to , and the consequence of the Cauchy-Bunyakovsky inequality will be: . , where . Moreover, in this case the signs and k match: .

Correlation analysis

Correlation analysis- method of processing statistical data, which consists in studying the coefficients ( correlations) between variables. In this case, the correlation coefficients between one pair or multiple pairs of features are compared to establish statistical relationships between them.

Target correlation analysis- provide some information about one variable with the help of another variable. In cases where it is possible to achieve the goal, we say that the variables correlate. In the very general view accepting the hypothesis of the presence of a correlation means that a change in the value of variable A will occur simultaneously with a proportional change in the value of B: if both variables increase, then correlation is positive if one variable increases and the other decreases, correlation is negative.

The correlation reflects only the linear dependence of the quantities, but does not reflect their functional connectivity. For example, if we calculate the correlation coefficient between the values A = sin(x) and B = cos(x), then it will be close to zero, i.e., there is no dependence between the quantities. Meanwhile, the quantities A and B are obviously related functionally according to the law sin 2(x) + cos 2(x) = 1.

Limitations of correlation analysis



Plots of distributions of pairs (x,y) with corresponding x and y correlation coefficients for each of them. Note that the correlation coefficient reflects a linear relationship (top row), but does not describe a relationship curve (middle row), and is not at all suitable for describing complex, non-linear relationships (bottom row).
  1. Application is possible if there are a sufficient number of cases to study: for a particular type of correlation coefficient, it ranges from 25 to 100 pairs of observations.
  2. The second limitation follows from the hypothesis of correlation analysis, which includes linear dependence of variables. In many cases, when it is reliably known that the dependence exists, the correlation analysis may not give results simply because the dependence is non-linear (expressed, for example, as a parabola).
  3. By itself, the fact of correlation does not give grounds to assert which of the variables precedes or causes changes, or that the variables are generally causally related to each other, for example, due to the action of a third factor.

Application area

This method of processing statistical data is very popular in economics and social sciences (in particular, in psychology and sociology), although the scope of application of correlation coefficients is extensive: quality control of industrial products, metallurgy, agricultural chemistry, hydrobiology, biometrics, and others.

The popularity of the method is due to two points: the correlation coefficients are relatively easy to calculate, their application does not require special mathematical training. Combined with the ease of interpretation, the ease of application of the coefficient has led to its widespread use in the field of statistical data analysis.

spurious correlation

The often tempting simplicity of a correlation study encourages the researcher to draw false intuitive conclusions about the presence of a causal relationship between pairs of traits, while the correlation coefficients establish only statistical relationships.

In the modern quantitative methodology of the social sciences, in fact, there has been a abandonment of attempts to establish causal relationships between observed variables by empirical methods. Therefore, when researchers in the social sciences talk about establishing relationships between the variables they study, either a general theoretical assumption or a statistical dependence is implied.

see also

  • Autocorrelation function
  • Cross-correlation function
  • covariance
  • Determination coefficient
  • Regression analysis

Wikimedia Foundation. 2010.

The correlation coefficient reflects the degree of relationship between two indicators. Always takes a value from -1 to 1. If the coefficient is located near 0, then they say that there is no connection between the variables.

If the value is close to one (from 0.9, for example), then there is a strong direct relationship between the observed objects. If the coefficient is close to the other extreme point of the range (-1), then there is a strong inverse relationship between the variables. When the value is somewhere in the middle from 0 to 1 or from 0 to -1, then we are talking about a weak relationship (forward or reverse). This relationship is usually not taken into account: it is considered that it does not exist.

Calculation of the correlation coefficient in Excel

Consider, for example, methods for calculating the correlation coefficient, features of the direct and inverse relationship between variables.

Values ​​of indicators x and y:

Y is the independent variable, x is the dependent variable. It is necessary to find the strength (strong / weak) and the direction (forward / reverse) of the relationship between them. The formula for the correlation coefficient looks like this:


To simplify its understanding, we will break it down into several simple elements.

There is a strong direct relationship between the variables.

The built-in CORREL function avoids complicated calculations. Let's calculate the pair correlation coefficient in Excel using it. We call the master of functions. We find what we need. The function arguments are an array of y values ​​and an array of x values:

Let's show the values ​​of the variables on the chart:


There is a strong relationship between y and x, because The lines run almost parallel to each other. The relationship is direct: increasing y - increasing x, decreasing y - decreasing x.



Matrix of Pairwise Correlation Coefficients in Excel

The correlation matrix is ​​a table, at the intersection of rows and columns of which there are correlation coefficients between the corresponding values. It makes sense to build it for several variables.

The matrix of correlation coefficients in Excel is built using the "Correlation" tool from the "Data Analysis" package.


A strong direct relationship was found between the values ​​of y and x1. There is a strong feedback between x1 and x2. There is practically no connection with the values ​​in the x3 column.

Notice! The solution to your specific problem will look similar to this example, including all the tables and explanatory texts below, but taking into account your initial data ...

A task:
There is a related sample of 26 pairs of values ​​(x k ,y k ):

k 1 2 3 4 5 6 7 8 9 10
x k 25.20000 26.40000 26.00000 25.80000 24.90000 25.70000 25.70000 25.70000 26.10000 25.80000
y k 30.80000 29.40000 30.20000 30.50000 31.40000 30.30000 30.40000 30.50000 29.90000 30.40000

k 11 12 13 14 15 16 17 18 19 20
x k 25.90000 26.20000 25.60000 25.40000 26.60000 26.20000 26.00000 22.10000 25.90000 25.80000
y k 30.30000 30.50000 30.60000 31.00000 29.60000 30.40000 30.70000 31.60000 30.50000 30.60000

k 21 22 23 24 25 26
x k 25.90000 26.30000 26.10000 26.00000 26.40000 25.80000
y k 30.70000 30.10000 30.60000 30.50000 30.70000 30.80000

It is required to calculate/build:
- correlation coefficient;
- test the hypothesis of the dependence of random variables X and Y, at a significance level α = 0.05;
- coefficients of the linear regression equation;
- scatter diagram (correlation field) and regression line graph;

SOLUTION:

1. Calculate the correlation coefficient.

The correlation coefficient is an indicator of the mutual probabilistic influence of two random variables. Correlation coefficient R can take values ​​from -1 before +1 . If the absolute value is closer to 1 , then this is evidence of a strong relationship between the quantities, and if closer to 0 - then, this indicates a weak connection or its absence. If the absolute value R equal to one, then we can talk about a functional relationship between quantities, that is, one quantity can be expressed in terms of another using a mathematical function.


You can calculate the correlation coefficient using the following formulas:
n
Σ
k = 1
(x k -M x) 2 , y 2 =
Mx =
1
n
n
Σ
k = 1
x k , M y =

or according to the formula

Rx,y =
M xy - M x M y
SxSy
(1.4), where:
Mx =
1
n
n
Σ
k = 1
x k , M y =
1
n
n
Σ
k = 1
y k , Mxy =
1
n
n
Σ
k = 1
x k y k (1.5)
S x 2 =
1
n
n
Σ
k = 1
x k 2 - M x 2, S y 2 =
1
n
n
Σ
k = 1
y k 2 - M y 2 (1.6)

In practice, formula (1.4) is more often used to calculate the correlation coefficient, since it requires less computation. However, if the covariance was previously calculated cov(X,Y), then it is more advantageous to use formula (1.1), because in addition to the actual value of the covariance, you can also use the results of intermediate calculations.

1.1 Calculate the correlation coefficient using the formula (1.4), for this we calculate the values ​​x k 2 , y k 2 and x k y k and enter them in table 1.

Table 1


k
x k y k x k 2 y k 2 x ky k
1 2 3 4 5 6
1 25.2 30.8 635.04000 948.64000 776.16000
2 26.4 29.4 696.96000 864.36000 776.16000
3 26.0 30.2 676.00000 912.04000 785.20000
4 25.8 30.5 665.64000 930.25000 786.90000
5 24.9 31.4 620.01000 985.96000 781.86000
6 25.7 30.3 660.49000 918.09000 778.71000
7 25.7 30.4 660.49000 924.16000 781.28000
8 25.7 30.5 660.49000 930.25000 783.85000
9 26.1 29.9 681.21000 894.01000 780.39000
10 25.8 30.4 665.64000 924.16000 784.32000
11 25.9 30.3 670.81000 918.09000 784.77000
12 26.2 30.5 686.44000 930.25000 799.10000
13 25.6 30.6 655.36000 936.36000 783.36000
14 25.4 31 645.16000 961.00000 787.40000
15 26.6 29.6 707.56000 876.16000 787.36000
16 26.2 30.4 686.44000 924.16000 796.48000
17 26 30.7 676.00000 942.49000 798.20000
18 22.1 31.6 488.41000 998.56000 698.36000
19 25.9 30.5 670.81000 930.25000 789.95000
20 25.8 30.6 665.64000 936.36000 789.48000
21 25.9 30.7 670.81000 942.49000 795.13000
22 26.3 30.1 691.69000 906.01000 791.63000
23 26.1 30.6 681.21000 936.36000 798.66000
24 26 30.5 676.00000 930.25000 793.00000
25 26.4 30.7 696.96000 942.49000 810.48000
26 25.8 30.8 665.64000 948.64000 794.64000


1.2. We calculate M x by formula (1.5).

1.2.1. x k

x 1 + x 2 + ... + x 26 = 25.20000 + 26.40000 + ... + 25.80000 = 669.500000

1.2.2.

669.50000 / 26 = 25.75000

M x = 25.750000

1.3. Similarly, we calculate M y.

1.3.1. Let's add all the elements in sequence y k

y 1 + y 2 + … + y 26 = 30.80000 + 29.40000 + ... + 30.80000 = 793.000000

1.3.2. Divide the resulting sum by the number of sample elements

793.00000 / 26 = 30.50000

M y = 30.500000

1.4. Similarly, we calculate M xy.

1.4.1. We add sequentially all the elements of the 6th column of table 1

776.16000 + 776.16000 + ... + 794.64000 = 20412.830000

1.4.2. Divide the resulting sum by the number of elements

20412.83000 / 26 = 785.10885

M xy = 785.108846

1.5. Calculate the value of S x 2 using the formula (1.6.).

1.5.1. We add sequentially all the elements of the 4th column of table 1

635.04000 + 696.96000 + ... + 665.64000 = 17256.910000

1.5.2. Divide the resulting sum by the number of elements

17256.91000 / 26 = 663.72731

1.5.3. Subtract from the last number the square of the value M x we ​​get the value for S x 2

S x 2 = 663.72731 - 25.75000 2 = 663.72731 - 663.06250 = 0.66481

1.6. Calculate the value of S y 2 by the formula (1.6.).

1.6.1. We add sequentially all the elements of the 5th column of table 1

948.64000 + 864.36000 + ... + 948.64000 = 24191.840000

1.6.2. Divide the resulting sum by the number of elements

24191.84000 / 26 = 930.45538

1.6.3. Subtract from the last number the square of M y , we get the value for S y 2

S y 2 = 930.45538 - 30.50000 2 = 930.45538 - 930.25000 = 0.20538

1.7. Let us calculate the product of S x 2 and S y 2.

S x 2 S y 2 = 0.66481 0.20538 = 0.136541

1.8. Extract the last number Square root, we get the value S x S y.

S x S y = 0.36951

1.9. Calculate the value of the correlation coefficient according to the formula (1.4.).

R = (785.10885 - 25.75000 30.50000) / 0.36951 = (785.10885 - 785.37500) / 0.36951 = -0.72028

ANSWER: Rx,y = -0.720279

2. We check the significance of the correlation coefficient (we check the dependence hypothesis).

Since the estimate of the correlation coefficient is calculated on a finite sample, and therefore may deviate from its general value, it is necessary to check the significance of the correlation coefficient. The check is made using the t-criterion:

t =
Rx,y
n - 2
1 - R 2 x,y
(2.1)

Random value t follows Student's t-distribution and according to the table of t-distribution it is necessary to find the critical value of the criterion (t cr.α) at ​​a given significance level α . If the modulo t calculated by formula (2.1) turns out to be less than t cr.α , then there is no dependence between random variables X and Y. Otherwise, the experimental data do not contradict the hypothesis about the dependence of random variables.


2.1. Calculate the value of the t-criterion according to the formula (2.1) we get:
t =
-0.72028
26 - 2
1 - (-0.72028) 2
= -5.08680

2.2. Let us determine the critical value of the parameter t cr.α from the table of t-distribution

The desired value t kr.α is located at the intersection of the row corresponding to the number of degrees of freedom and the column corresponding to a given significance level α .
In our case, the number of degrees of freedom is n - 2 = 26 - 2 = 24 and α = 0.05 , which corresponds to the critical value of the criterion t cr.α = 2.064 (see table 2)

table 2 t-distribution

Number of degrees of freedom
(n - 2)
α = 0.1 α = 0.05 α = 0.02 α = 0.01 α = 0.002 α = 0.001
1 6.314 12.706 31.821 63.657 318.31 636.62
2 2.920 4.303 6.965 9.925 22.327 31.598
3 2.353 3.182 4.541 5.841 10.214 12.924
4 2.132 2.776 3.747 4.604 7.173 8.610
5 2.015 2.571 3.365 4.032 5.893 6.869
6 1.943 2.447 3.143 3.707 5.208 5.959
7 1.895 2.365 2.998 3.499 4.785 5.408
8 1.860 2.306 2.896 3.355 4.501 5.041
9 1.833 2.262 2.821 3.250 4.297 4.781
10 1.812 2.228 2.764 3.169 4.144 4.587
11 1.796 2.201 2.718 3.106 4.025 4.437
12 1.782 2.179 2.681 3.055 3.930 4.318
13 1.771 2.160 2.650 3.012 3.852 4.221
14 1.761 2.145 2.624 2.977 3.787 4.140
15 1.753 2.131 2.602 2.947 3.733 4.073
16 1.746 2.120 2.583 2.921 3.686 4.015
17 1.740 2.110 2.567 2.898 3.646 3.965
18 1.734 2.101 2.552 2.878 3.610 3.922
19 1.729 2.093 2.539 2.861 3.579 3.883
20 1.725 2.086 2.528 2.845 3.552 3.850
21 1.721 2.080 2.518 2.831 3.527 3.819
22 1.717 2.074 2.508 2.819 3.505 3.792
23 1.714 2.069 2.500 2.807 3.485 3.767
24 1.711 2.064 2.492 2.797 3.467 3.745
25 1.708 2.060 2.485 2.787 3.450 3.725
26 1.706 2.056 2.479 2.779 3.435 3.707
27 1.703 2.052 2.473 2.771 3.421 3.690
28 1.701 2.048 2.467 2.763 3.408 3.674
29 1.699 2.045 2.462 2.756 3.396 3.659
30 1.697 2.042 2.457 2.750 3.385 3.646
40 1.684 2.021 2.423 2.704 3.307 3.551
60 1.671 2.000 2.390 2.660 3.232 3.460
120 1.658 1.980 2.358 2.617 3.160 3.373
1.645 1.960 2.326 2.576 3.090 3.291


2.2. Let's compare the absolute value of the t-criterion and t cr.α

The absolute value of the t-criterion is not less than the critical one t = 5.08680, tcr.α = 2.064, therefore experimental data, with a probability of 0.95(1 - α ), do not contradict the hypothesis on the dependence of random variables X and Y.

3. We calculate the coefficients of the linear regression equation.

The linear regression equation is an equation of a straight line that approximates (approximately describes) the relationship between random variables X and Y. If we assume that X is free and Y is dependent on X, then the regression equation will be written as follows


Y = a + b X (3.1), where:

b=Rx,y
y
σ x
= Rx,y
Sy
S x
(3.2),
a = M y - b M x (3.3)

The coefficient calculated by formula (3.2) b is called the linear regression coefficient. In some sources a is called the constant regression coefficient and b according to the variables.

Prediction errors Y for a given value X are calculated by the formulas:

The value σ y/x (formula 3.4) is also called residual standard deviation, it characterizes the departure of Y from the regression line described by equation (3.1) at a fixed (given) value of X.

.
S y 2 / S x 2 = 0.20538 / 0.66481 = 0.30894. We extract the square root from the last number - we get:
S y / S x = 0.55582

3.3 Calculate the coefficient b by formula (3.2)

b = -0.72028 0.55582 = -0.40035

3.4 Calculate the coefficient a by formula (3.3)

a = 30.50000 - (-0.40035 25.75000) = 40.80894

3.5 Estimate the errors of the regression equation.

3.5.1 We extract the square root from S y 2 and get:

= 0.31437
3.5.4 Let us calculate the relative error by the formula (3.5)

δy/x = (0.31437 / 30.50000)100% = 1.03073%

4. We build a scatterplot (correlation field) and a graph of the regression line.

A scatterplot is a graphic representation of the corresponding pairs (x k , y k ) as points in a plane, in rectangular coordinates with the X and Y axes. The correlation field is one of the graphical representations of a linked (paired) sample. In the same coordinate system, the graph of the regression line is also plotted. The scales and starting points on the axes should be chosen carefully so that the diagram is as clear as possible.

4.1. We find the minimum and maximum element of the sample X is the 18th and 15th elements, respectively, x min = 22.10000 and x max = 26.60000.

4.2. We find the minimum and maximum element of the sample Y is the 2nd and 18th elements, respectively, y min = 29.40000 and y max = 31.60000.

4.3. On the abscissa axis, we select the starting point just to the left of the point x 18 = 22.10000, and such a scale that the point x 15 = 26.60000 fits on the axis and the other points are clearly distinguished.

4.4. On the y-axis, we select the starting point just to the left of the point y 2 = 29.40000, and such a scale that the point y 18 = 31.60000 fits on the axis and the other points are clearly distinguished.

4.5. On the abscissa axis we place the values ​​x k , and on the ordinate axis we place the values ​​y k .

4.6. We put points (x 1, y 1), (x 2, y 2), ..., (x 26, y 26) on the coordinate plane. We get a scatterplot (correlation field), shown in the figure below.

4.7. Let's draw a regression line.

To do this, we find two different points with coordinates (x r1 , y r1) and (x r2 , y r2) satisfying equation (3.6), put them on the coordinate plane and draw a line through them. Let's take x min = 22.10000 as the abscissa of the first point. We substitute the value of x min in equation (3.6), we get the ordinate of the first point. Thus, we have a point with coordinates (22.10000, 31.96127). Similarly, we obtain the coordinates of the second point, setting the value x max = 26.60000 as the abscissa. The second point will be: (26.60000, 30.15970).

The regression line is shown in the figure below in red

Please note that the regression line always passes through the point of the average values ​​of X and Y, i.e. with coordinates (M x , M y).

COURSE WORK

Topic: Correlation analysis

Introduction

1. Correlation analysis

1.1 The concept of correlation

1.2 General classification of correlations

1.3 Correlation fields and the purpose of their construction

1.4 Stages of correlation analysis

1.5 Correlation coefficients

1.6 Normalized Bravais-Pearson correlation coefficient

1.7 Spearman's rank correlation coefficient

1.8 Basic properties of correlation coefficients

1.9 Checking the significance of correlation coefficients

1.10 Critical values ​​of the pair correlation coefficient

2. Planning a multivariate experiment

2.1 Condition of the problem

2.2 Determination of the center of the plan (main level) and the level of variation of factors

2.3 Building a planning matrix

2.4 Checking the homogeneity of the dispersion and the equal accuracy of measurements in different series

2.5 Coefficients of the regression equation

2.6 Reproducibility dispersion

2.7 Checking the significance of the coefficients of the regression equation

2.8 Checking the adequacy of the regression equation

Conclusion

Bibliography

INTRODUCTION

Experiment planning is a mathematical-statistical discipline that studies the methods of rational organization of experimental research - from the optimal choice of the factors under study and the determination of the actual plan of the experiment in accordance with its purpose to methods for analyzing the results. The beginning of experiment planning was laid by the works of the English statistician R. Fisher (1935), who emphasized that rational experiment planning gives no less significant gain in the accuracy of estimates than optimal processing of measurement results. In the 60s of the 20th century, a modern theory of experiment planning emerged. Its methods are closely related to the theory of approximation of functions and mathematical programming. Optimal plans are constructed and their properties are investigated for a wide class of models.

Experiment planning - the choice of an experiment plan that meets the specified requirements, a set of actions aimed at developing an experimentation strategy (from obtaining a priori information to obtaining a workable mathematical model or determining optimal conditions). This is a purposeful control of the experiment, implemented in conditions of incomplete knowledge of the mechanism of the phenomenon under study.

In the process of measurements, subsequent data processing, as well as formalization of the results in the form of a mathematical model, errors occur and part of the information contained in the original data is lost. The use of experiment planning methods makes it possible to determine the error of the mathematical model and judge its adequacy. If the accuracy of the model is insufficient, then the use of experiment planning methods makes it possible to modernize the mathematical model with additional experiments without losing previous information and at minimal cost.

The purpose of experiment planning is to find such conditions and rules for conducting experiments under which it is possible to obtain reliable and reliable information about the object with the least labor costs, and also to present this information in a compact and convenient form with quantification accuracy.

Among the main planning methods used in different stages research uses:

Planning a screening experiment, the main meaning of which is the selection of a group of significant factors from the totality of factors that are subject to further detailed study;

Designing an experiment for analysis of variance, i.e. drawing up plans for objects with qualitative factors;

Planning a regression experiment that allows you to obtain regression models (polynomial and others);

Planning an extreme experiment, in which the main task is the experimental optimization of the object of study;

Planning in the study of dynamic processes, etc.

The purpose of studying the discipline is to prepare students for production and technical activities in the specialty using the methods of planning theory and modern information technologies.

Objectives of the discipline: study modern methods planning, organizing and optimizing scientific and industrial experiments, conducting experiments and processing the results.

1. CORRELATION ANALYSIS

1.1 The concept of correlation

The researcher is often interested in how two or more variables are related to each other in one or more of the studied samples. For example, can height affect a person's weight, or can pressure affect product quality?

This kind of relationship between variables is called correlation, or correlation. A correlation is a consistent change in two features, reflecting the fact that the variability of one feature is in line with the variability of the other.

It is known, for example, that on average there is a positive relationship between the height of people and their weight, and such that the greater the height, the greater the weight of a person. However, there are exceptions to this rule when relatively short people are overweight, and, conversely, asthenics, with high growth, are light. The reason for such exclusions is that each biological, physiological or psychological trait is determined by the influence of many factors: environmental, genetic, social, ecological, etc.

Correlations are probabilistic changes that can only be studied on representative samples by methods of mathematical statistics. Both terms - correlation and correlation dependence - are often used interchangeably. Dependence means influence, connection - any coordinated changes that can be explained by hundreds of reasons. Correlations cannot be considered as evidence of a causal relationship, they only indicate that changes in one feature, as a rule, are accompanied by certain changes in another.

Correlation dependence - are the changes that the values ​​of one feature make to the probability of occurrence different values another sign.

The task of correlation analysis is reduced to establishing the direction (positive or negative) and the form (linear, non-linear) of the relationship between varying features, measuring its tightness, and, finally, checking the significance level of the obtained correlation coefficients.

Correlations differ in form, direction and degree (strength) .

The shape of the correlation can be rectilinear or curvilinear. For example, the relationship between the number of training sessions on the simulator and the number of correctly solved problems in the control session can be straightforward. Curvilinear can be, for example, the relationship between the level of motivation and the effectiveness of the task (Figure 1). With an increase in motivation, the efficiency of the task first increases, then the optimal level of motivation is reached, which corresponds to the maximum efficiency of the task; a further increase in motivation is accompanied by a decrease in efficiency.

Figure 1 - The relationship between the effectiveness of problem solving and the strength of the motivational tendency

In direction, the correlation can be positive ("direct") and negative ("reverse"). With a positive straight-line correlation, higher values ​​of one attribute correspond to higher values ​​of another, and lower values ​​of one attribute correspond to low values ​​of another (Figure 2). With a negative correlation, the ratios are reversed (Figure 3). With a positive correlation, the correlation coefficient has a positive sign, with a negative correlation - a negative sign.

Figure 2 - Direct correlation

Figure 3 - Inverse correlation


Figure 4 - No correlation

The degree, strength or tightness of the correlation is determined by the value of the correlation coefficient. The strength of the connection does not depend on its direction and is determined by the absolute value of the correlation coefficient.

1.2 General classification of correlations

Depending on the correlation coefficient, the following correlations are distinguished:

Strong or close with correlation coefficient r>0.70;

Medium (at 0.50

Moderate (at 0.30

Weak (at 0.20

Very weak (at r<0,19).

1.3 Correlation fields and the purpose of their construction

Correlation is studied on the basis of experimental data, which are the measured values ​​(x i , y i) of two features. If there is little experimental data, then the two-dimensional empirical distribution is represented as a double series of x i and y i values. In this case, the correlation between features can be described in different ways. The correspondence between an argument and a function can be given by a table, formula, graph, etc.

Correlation analysis, like other statistical methods, is based on the use of probabilistic models that describe the behavior of the studied features in a certain general population, from which the experimental values ​​x i and y i are obtained. When the correlation between quantitative characteristics, the values ​​of which can be accurately measured in units of metric scales (meters, seconds, kilograms, etc.), is investigated, the model of a two-dimensional normally distributed general population is very often adopted. Such a model displays the relationship between the variables x i and y i graphically as a locus of points in a rectangular coordinate system. This graphical dependence is also called a scatterplot or correlation field.
This model of a two-dimensional normal distribution (correlation field) allows you to give a visual graphical interpretation of the correlation coefficient, because distribution in aggregate depends on five parameters: μ x , μ y – average values ​​(mathematical expectations); σ x ,σ y are the standard deviations of the random variables X and Y and p is the correlation coefficient, which is a measure of the relationship between the random variables X and Y.
If p \u003d 0, then the values, x i , y i , obtained from a two-dimensional normal population, are located on the graph in x, y coordinates within the area bounded by a circle (Figure 5, a). In this case, there is no correlation between the random variables X and Y and they are called uncorrelated. For a two-dimensional normal distribution, uncorrelatedness means at the same time the independence of the random variables X and Y.