In newer versions of python, the () function has a duplicates parameter, which solves the problem of too many duplicates causing errors in equal frequency bins;
In older versions of python, provide a solution:
import pandas as pd def pct_rank_qcut(series, n): ''' series: columns to be divided into boxes n: number of boxes ''' edages = ([i/n for i in range(n)] # Converted to percentage func = lambda x: (edages >= x).argmax() # function: (edages >= x) returns the index of the first occurrence of true in the fasle/true list return (pct=1).astype(float).apply(func) #(pct=1)Percentiles corresponding to each value,The final number of groups is returned;rank()If the data type passed in by the function isobject,The results can be problematic.,Therefore, it has been carried outastype
Supplementary extension: Python data discretization: equal width and equal frequency
When dealing with data, we often need to discretize continuous variables, the most common ways are equal-width discretization, equal-frequency discretization, here we discuss the concept of discretization, and only give the implementation in python for reference.
1. Equal-width discretization
Divide using the cut() function in pandas
import numpy as np import pandas as pd # Discretization: Equal Width # # Datas: Sample * Feature def Discretization_EqualWidth(K, Datas, FeatureNumber): DisDatas = np.zeros_like(Datas) for i in range(FeatureNumber): DisOneFeature = (Datas[:, i], K, labels=range(1, K+1)) DisDatas[:, i] = DisOneFeature return DisDatas
2. Isofrequency discretization
pandas has qcut() can be used, but the boundary is prone to duplicate values, if in order to delete duplicate values set duplicates='drop', it is easy to appear in the number of slices less than the specified number of problems, so do not use qcut() here!
import numpy as np import pandas as pd # Discretization: Equal Frequency # # vector: single feature def Rank_qcut(vector, K): quantile = ([float(i) / K for i in range(K + 1)]) # Quantile: K+1 values funBounder = lambda x: (quantile >= x).argmax() return (pct=True).apply(funBounder) # Discretization: Equal Frequency # # Datas: Sample * Feature def Discretization_EqualFrequency(K, Datas, FeatureNumber): DisDatas = np.zeros_like(Datas) w = [float(i) / K for i in range(K + 1)] for i in range(FeatureNumber): DisOneFeature = Rank_qcut((Datas[:, i]), K) #print(DisOneFeature) DisDatas[:, i] = DisOneFeature return DisDatas
The above this python based equal frequency binning qcut problem solving is all that I have shared with you, I hope to give you a reference, and I hope you will support me more.