Suppose we want to construct a new feature b
The purpose is to filter out the data with values between 4 and 6 from a. If it matches, it is True, otherwise it is False.
Then the code is as follows
import pandas as pd lists=({'a':[1,2,3,4,5,6,7,8,9]}) lists['b']=(lists['a']<6).mul(lists['a']>4)
Add: dataframe to find the multiplication of two columns, and then the output will be a new column
Look at the code.
df["new"]=df3["rate"]*df3["duration"]
new is the name of the new column.
rate and duration are the columns to be multiplied.
Addition, subtraction, multiplication and division all apply!
Supplementary: DataFrame-derived new feature operations
A new feature is derived from the value of a column in the
# Derive the value of the LBL1 feature as a new feature in one-hot form piao=df_train_log.LBL1.value_counts().index # First construct a temporary df df_tmp=({'USRID':df_train_log.drop_duplicates('USRID').}) # Set all new feature columns to 0 for i in piao: df_tmp['PIAO_'+i]=0 # Conduct grouping facilitation, with this feature is set to 1, the original data has multiple records per USRID, so grouping statistics group=df_train_log.groupby(['USRID']) for k in (): t = group.get_group(k) id=.value_counts().index[0] tmp_list=t.LBL1.value_counts().index for j in tmp_list: df_tmp['PIAO_'+j].loc[df_tmp.USRID==id]=1
2. Grouping statistics to select the value item with the highest number of occurrences of the variable under the same USRID
group=df_train_log.groupby(['USRID']) lt=[] list_max_lbl1=[] list_max_lbl2=[] list_max_lbl3=[] for k in (): t = group.get_group(k) # Find the most frequent items by value_counts argmx = (t['EVT_LBL'].value_counts()) lbl1_max=(t['LBL1'].value_counts()) lbl2_max=(t['LBL2'].value_counts()) lbl3_max=(t['LBL3'].value_counts()) list_max_lbl1.append(lbl1_max) list_max_lbl2.append(lbl2_max) list_max_lbl3.append(lbl3_max) # Only the most frequent items are left c = t[t['EVT_LBL']==argmx].drop_duplicates('EVT_LBL') # Into the list (c) # Construct a new df df_train_log_new = (lt) #Another three features were constructed, the terms with the highest number of occurrences for LBL1-LBL3 respectively df_train_log_new['LBL1_MAX']=list_max_lbl1 df_train_log_new['LBL2_MAX']=list_max_lbl2 df_train_log_new['LBL3_MAX']=list_max_lbl3
3. derive ont-hot new features of whether or not it happened on a particular day
#Create temporary df, Wednesday, Saturday, and Sunday, all set to 0 by default. df_day=({'USRID':df_train_log.drop_duplicates('USRID').}) df_day['weekday_3']=0 df_day['weekday_6']=0 df_day['weekday_7']=0 #Grouping statistics, set to 1 if there is one, set to 0 if there is none. group=df_train_log.groupby(['USRID']) for k in (): t = group.get_group(k) id=.value_counts().index[0] tmp_list=t.occ_dayofweek.value_counts().index for j in tmp_list: if j==3: df_day['weekday_3'].loc[df_tmp.USRID==id]=1 elif j==6: df_day['weekday_6'].loc[df_tmp.USRID==id]=1 elif j==7: df_day['weekday_7'].loc[df_tmp.USRID==id]=1
4. See how many seconds the user stays on the APP in total and how many days he/she has looked at the APP.
# First convert the date to a timestamp and give it a new feature tmp_list=[] for i in df_train_log.OCC_TIM: d=(str(i),"%Y-%m-%d %H:%M:%S") evt_time = (()) tmp_list.append(evt_time) df_train_log['time']=tmp_list # Every next line minus the previous line to get app dwell time df_train_log['diff_time']=df_train_log.time-df_train_log.(1) #Construct a new dataFrame that groups to get the number of days to view the app df_time=({'USRID':df_train_log.drop_duplicates('USRID').}) # Have a few days to view df_time['days']=0 group=df_train_log.groupby(['USRID']) for k in (): t = group.get_group(k) id=set().pop() df_time['days'].loc[df_time.USRID==id]= len(t.occ_day.value_counts().index) #remove some anomalous timestamps, such as the subtraction of two days apart, which is definitely inappropriate, and NA's as well df_train_log=df_train_log[(df_train_log.diff_time>0)&(df_train_log.diff_time<8000)] # Cumulative length of stay group_stayTime=df_train_log['diff_time'].groupby(df_train_log['USRID']).sum() #Creating new df df_tmp=({'USRID':list(group_stayTime.),'stay_time':list(group_stayTime.values)}) # Merged into a new df df=(df_time,df_tmp,on=['USRID'],how='left')#post-merger,Missing dwell time,set as(0,axis=1,inplace=True)
The above is a personal experience, I hope it can give you a reference, and I hope you can support me more.