SoFunction
Updated on 2024-10-30

An introduction to the new features of the dataframe two-column multiplication construction

Suppose we want to construct a new feature b

The purpose is to filter out the data with values between 4 and 6 from a. If it matches, it is True, otherwise it is False.

Then the code is as follows

import pandas as pd
lists=({'a':[1,2,3,4,5,6,7,8,9]})
lists['b']=(lists['a']<6).mul(lists['a']>4)

Add: dataframe to find the multiplication of two columns, and then the output will be a new column

Look at the code.

df["new"]=df3["rate"]*df3["duration"]

new is the name of the new column.

rate and duration are the columns to be multiplied.

Addition, subtraction, multiplication and division all apply!

Supplementary: DataFrame-derived new feature operations

A new feature is derived from the value of a column in the

# Derive the value of the LBL1 feature as a new feature in one-hot form
piao=df_train_log.LBL1.value_counts().index
# First construct a temporary df
df_tmp=({'USRID':df_train_log.drop_duplicates('USRID').})
# Set all new feature columns to 0
for i in piao:
    df_tmp['PIAO_'+i]=0
# Conduct grouping facilitation, with this feature is set to 1, the original data has multiple records per USRID, so grouping statistics
group=df_train_log.groupby(['USRID'])
for k in ():
    t = group.get_group(k)
    id=.value_counts().index[0]
    tmp_list=t.LBL1.value_counts().index
    for j in tmp_list:
        df_tmp['PIAO_'+j].loc[df_tmp.USRID==id]=1

2. Grouping statistics to select the value item with the highest number of occurrences of the variable under the same USRID

group=df_train_log.groupby(['USRID'])
lt=[]
list_max_lbl1=[]
list_max_lbl2=[]
list_max_lbl3=[]
for k in ():
    t = group.get_group(k)
    # Find the most frequent items by value_counts
    argmx = (t['EVT_LBL'].value_counts())
    lbl1_max=(t['LBL1'].value_counts())
    lbl2_max=(t['LBL2'].value_counts())
    lbl3_max=(t['LBL3'].value_counts())
    list_max_lbl1.append(lbl1_max)
    list_max_lbl2.append(lbl2_max)
    list_max_lbl3.append(lbl3_max)
    # Only the most frequent items are left
    c = t[t['EVT_LBL']==argmx].drop_duplicates('EVT_LBL')
    # Into the list
    (c)
# Construct a new df
df_train_log_new = (lt)
#Another three features were constructed, the terms with the highest number of occurrences for LBL1-LBL3 respectively
df_train_log_new['LBL1_MAX']=list_max_lbl1
df_train_log_new['LBL2_MAX']=list_max_lbl2
df_train_log_new['LBL3_MAX']=list_max_lbl3

3. derive ont-hot new features of whether or not it happened on a particular day

#Create temporary df, Wednesday, Saturday, and Sunday, all set to 0 by default.
df_day=({'USRID':df_train_log.drop_duplicates('USRID').})
df_day['weekday_3']=0
df_day['weekday_6']=0
df_day['weekday_7']=0
#Grouping statistics, set to 1 if there is one, set to 0 if there is none.
group=df_train_log.groupby(['USRID'])
for k in ():
    t = group.get_group(k)
    id=.value_counts().index[0]
    tmp_list=t.occ_dayofweek.value_counts().index
    for j in tmp_list:
        if j==3:
            df_day['weekday_3'].loc[df_tmp.USRID==id]=1
        elif j==6:
            df_day['weekday_6'].loc[df_tmp.USRID==id]=1
        elif j==7:
            df_day['weekday_7'].loc[df_tmp.USRID==id]=1

4. See how many seconds the user stays on the APP in total and how many days he/she has looked at the APP.

# First convert the date to a timestamp and give it a new feature
tmp_list=[]
for i in df_train_log.OCC_TIM:
    d=(str(i),"%Y-%m-%d %H:%M:%S")
    evt_time = (())
    tmp_list.append(evt_time)
df_train_log['time']=tmp_list
# Every next line minus the previous line to get app dwell time
df_train_log['diff_time']=df_train_log.time-df_train_log.(1)
#Construct a new dataFrame that groups to get the number of days to view the app
df_time=({'USRID':df_train_log.drop_duplicates('USRID').})
# Have a few days to view
df_time['days']=0
group=df_train_log.groupby(['USRID'])
for k in ():
    t = group.get_group(k)
    id=set().pop()
    df_time['days'].loc[df_time.USRID==id]= len(t.occ_day.value_counts().index)
#remove some anomalous timestamps, such as the subtraction of two days apart, which is definitely inappropriate, and NA's as well
df_train_log=df_train_log[(df_train_log.diff_time>0)&(df_train_log.diff_time<8000)]
# Cumulative length of stay
group_stayTime=df_train_log['diff_time'].groupby(df_train_log['USRID']).sum()
#Creating new df
df_tmp=({'USRID':list(group_stayTime.),'stay_time':list(group_stayTime.values)})
# Merged into a new df
df=(df_time,df_tmp,on=['USRID'],how='left')#post-merger,Missing dwell time,set as(0,axis=1,inplace=True)

The above is a personal experience, I hope it can give you a reference, and I hope you can support me more.