preamble
Correlation analysis is one of the basics of many algorithms and modeling, and it's a classic. Correlation analysis can be used to express many feature relationships and trends. There are three common types of correlation coefficients: person correlation coefficient, spearman correlation coefficient, and Kendall's tau-b rank correlation coefficient. Each of them has its own usage and scenarios. Of course, I will write all the algorithms and principles + code for these three correlation coefficients in my column. Currently on the mathematical modeling column has been the traditional machine learning prediction algorithms, dimensional algorithms, temporal prediction algorithms and weighting algorithms written seven or eight, there is a need for this interest in the students can go to take a look.
I. Definitions
Definition of Kendall (Kendall) coefficient: n statistical objects of the same kind ordered by a specific attribute; other attributes are usually disordered. Same-ordered pairs (concordant pairs) and heterogeneous pairs (discordant pairs) difference to the total logarithm (n*(n-1)/2) is defined as the Kendall (Kendall) coefficient.
Similar to Spearman's rank correlation, Kendall's correlation is also a rank correlation coefficient, which is an assessment of the correlation (strength and direction) between two (random variables) based on the rank (rank) of the data object. The target object analyzed should be an ordered categorical variable, such as rank, age group, obesity class (severely obese, moderately obese, mildly obese, not obese), etc.
The difference is that Spearman's correlation is based on the rank difference (for example, Ming's history score in the class is ranked as 10 and his English score is ranked as 4, so in the Spearman's correlation analysis of the students' history and English scores in this class, the contribution of Ming's scores would be (10-4=6)) for the assessment of correlation, while the Kendall's correlation is based on the relationship between pairs of sample data for the analysis of the strength of correlation coefficients. The Kendall correlation is based on the relationship between the sample data pairs to analyze the strength of the correlation coefficient, and the data pairs can be classified into Concordant and Discordant pairs.
The kendall correlation coefficient is calculated as follows.
Suppose we set up a group of 8 people with height and weight where person A is the tallest, third heaviest, and so on:
Note that A is the highest, but with a weight ranking of 3 , is heavier than those with weight rankings of 4,5,6,7,8 and contributes 5 homoscedastic pairs, i.e., AB, AE, AF, AG, and AH. Similarly, we find that B, C, D, E, F, G, and H contribute 4, 5, 4, 4, 3, 1, 0, and 0 homoscedastic pairs, respectively, and, therefore, the number of homoscedastic pairs
P = 5 + 4 + 5 + 4 + 3 + 1 + 0 + 0 = 22.
Dissimilar pairs Q=28-22 (total pairs minus same-order pairs are dissimilar pairs)
Thus R = ((22-6)/28) = 0.57. This result shows a strong pattern between the rankings, as expected. We see that there is some correlation between the two rankings that can be measured objectively using the Kendall head coefficient that corresponds.
- If the agreement between the two rankings is perfect (i.e., the two rankings are the same), the value of the coefficient is one.
- If the divergence between the two rankings is perfect (i.e., one ranking is opposite to the other), the coefficient has the value -1.
- If X and Y are independent, then we expect the coefficients to be approximately zero.
II. Conditions of use
Before applying Kendall's correlation analysis first check that the data satisfies the following basic assumptions, which are met to ensure that the correlation analysis results you obtain are valid.
- Variable data is either ordinal or continuous. Ordinal scales are often used to measure non-numerical concepts in numerical terms, such as satisfaction, happiness, and so on, as well as for things like rankings in grades and competitions. Ordinal scales are self-explanatory: temperature, weight, income, etc. are all (either strictly or approximately) ordinal scales.
- The data of two variables should follow a monotonic relationship. In short, if one variable increases in value, the other increases, which is called a positive relationship, or if one variable increases in value, the other decreases, which is called a negative relationship. Of course, this monotonic relationship is a statistical one, or a trend, rather than strictly monotonic. This is shown below. Both the left and center graphs show an approximately monotonic relationship, while the right graph does not, because the left and right halves of the right graph have opposite trends.
III. Calculation formula and code examples
There are two formulas for calculating Kendall's coefficient, one is called Tau-c and the other is called Tau-b. The difference between the two is that Tau-b can handle cases where there are identical values, i.e., tied ranks.
-a
from import kendalltau import numpy as np import as plt dat1 = ([1,2,3,4,5,6,7,8]) dat2 = ([3,4,1,2,5,7,8,6]) fig,ax = () (dat1,dat2) kendalltau(dat1,dat2)
-b
In the above Tau-a calculations it is assumed that there are no tie rankings in the raw data. When side-by-side rankings exist in the raw data, the following formula gives a more accurate analysis.
The code is consistent except that the use of mathematical operations are inconsistent, I do not expand on the specifics, more information about python kendall coefficient correlation please pay attention to my other related articles!