SoFunction
Updated on 2024-10-30

Python-based solution to the equal-frequency split-box qcut problem

In newer versions of python, the () function has a duplicates parameter, which solves the problem of too many duplicates causing errors in equal frequency bins;

In older versions of python, provide a solution:

import pandas as pd
 
def pct_rank_qcut(series, n):
  '''
  series: columns to be divided into boxes
  n: number of boxes
  '''
  edages = ([i/n for i in range(n)] # Converted to percentage
  func = lambda x: (edages >= x).argmax() # function: (edages >= x) returns the index of the first occurrence of true in the fasle/true list
  return (pct=1).astype(float).apply(func) #(pct=1)Percentiles corresponding to each value,The final number of groups is returned;rank()If the data type passed in by the function isobject,The results can be problematic.,Therefore, it has been carried outastype

Supplementary extension: Python data discretization: equal width and equal frequency

When dealing with data, we often need to discretize continuous variables, the most common ways are equal-width discretization, equal-frequency discretization, here we discuss the concept of discretization, and only give the implementation in python for reference.

1. Equal-width discretization

Divide using the cut() function in pandas

import numpy as np
import pandas as pd
 
# Discretization: Equal Width #
# Datas: Sample * Feature
def Discretization_EqualWidth(K, Datas, FeatureNumber):
  DisDatas = np.zeros_like(Datas)
  for i in range(FeatureNumber):
    DisOneFeature = (Datas[:, i], K, labels=range(1, K+1))
    DisDatas[:, i] = DisOneFeature
  return DisDatas

2. Isofrequency discretization

pandas has qcut() can be used, but the boundary is prone to duplicate values, if in order to delete duplicate values set duplicates='drop', it is easy to appear in the number of slices less than the specified number of problems, so do not use qcut() here!

import numpy as np
import pandas as pd
 
# Discretization: Equal Frequency #
# vector: single feature
def Rank_qcut(vector, K):
  quantile = ([float(i) / K for i in range(K + 1)]) # Quantile: K+1 values
  funBounder = lambda x: (quantile >= x).argmax()
  return (pct=True).apply(funBounder)
 
# Discretization: Equal Frequency #
# Datas: Sample * Feature
def Discretization_EqualFrequency(K, Datas, FeatureNumber):
  DisDatas = np.zeros_like(Datas)
  w = [float(i) / K for i in range(K + 1)]
  for i in range(FeatureNumber):
    DisOneFeature = Rank_qcut((Datas[:, i]), K)
    #print(DisOneFeature)
    DisDatas[:, i] = DisOneFeature
  return DisDatas

The above this python based equal frequency binning qcut problem solving is all that I have shared with you, I hope to give you a reference, and I hope you will support me more.