SoFunction
Updated on 2024-10-30

The perfect combination of Python and Excel's common operation case summary

preamble

In the past, the corresponding English word for business analysis is Business Analysis, we use the analysis tool is Excel, and then the amount of data, Excel can not cope with it (Excel maximum number of rows to support 1048576 rows), people began to turn to python and R, such as analytical tools, this time the corresponding word for business analysis is Business Analytics.

In fact, the guidelines for using python and Excel are the same [We don't repeat ourselves], both of which are to replace mechanical operations and pure manual labor with more convenient operations whenever possible.

You can't do data analysis with python without the famous pandas package. After many versions of iterative optimization, the pandas ecosystem is now quite complete.The official website also gives a comparison between it and other analysis tools:

In this paper, the main use is also pandas, drawing with the library is plotly, the realization of the common functions of Excel have:

  • Interaction between Python and Excel
  • vlookup function
  • pivot table
  • drawings

I'll be back in the future to keep updating and adding to it if I uncover more Excel features.

Before you start, first load the pandas package as per usual:

import numpy as np
import pandas as pd
pd.set_option('max_columns', 10)
pd.set_option('max_rows', 20)
pd.set_option('display.float_format', lambda x: '%.2f' % x) # Prohibition of scientific notation

Interaction between Python and Excel

pandas in the most commonly used and Excel I/O related to the four functions are read_csv/ read_excel/ to_csv/ to_excel, they have specific parameter settings, you can customize the desired effect of reading and exporting.

Let's say you want to read the upper left part of such a table:

You can use pd.read_excel("", header=1, nrows=17, usecols=3) to return the results:

df
Out[]: 
Work number Name Gender Sector
0 A0001 Zhang Wei Male Engineering
1 A0002 Wang Xiuying Female Personnel
2 A0003 Wang Fang Female Administration
3 A0004 Zheng Yong male Market
4 A0005 Zhang Li Female Research and development
5 A0006 Wang Yan Female Logistics
6 A0007 Li Yong male Market
7 A0008 Li Juan Female Engineering
8 A0009 Zhang Jing Female Personnel
9 A0010 Wang Lei Male Administrative
10 A0011 Li Na Female Market
11 A0012 Liu Shiwen Female R&D
12 A0013 Wang Gang Male Logistics
13 A0014 Ye Qian Female Logistics
14 A0015 Jin Wenwen Female Market
15 A0016 Wang Chaojie Male Engineering
16 A0017 Li Jun Male Personnel

Output function is the same, how many columns to use, to index or not, how to put the title, can be controlled.

vlookup function

vlookup is known as one of the artifacts in Excel, a wide range of uses, the following example from Douban, VLOOKUP function of the most commonly used 10 kinds of use, you will be a few kinds?

Case 1

Question:The A3:B7 cell area is a letter grade lookup table, indicating that a score of less than 60 is grade E, 60 to 69 is grade D, 70 to 79 is grade C, 80 to 89 is grade B, and 90 or more is grade A. How do I return the letter grades of my second grade 1 class based on their language scores, as shown in the table in column D:G for the second grade 1 class language test scores?

Methods:Enter =VLOOKUP (G3, 3:7, 2) in the H3:H13 cell area

python implementation:

df = pd.read_excel("", sheet_name=0)
def grade_to_point(x):
    if x >= 90:
        return 'A'
    elif x >= 80:
        return 'B'
    elif x >= 70:
        return 'C'
    elif x >= 60:
        return 'D'
    else:
        return 'E'
df['Grade'] = df['Languages'].apply(grade_to_point)
df
 
Out[]: 
     student number   name and surname distinguishing between the sexes   multilingualism hierarchy
0   101  Wang Xiaoli (1906-1978), Mao *'s second wife  women   69  D
1   102  Wang Baoqin  male   85  B
2   103  Yang Yuping  women   49  E
3   104  Tien Tung-kai (1932-), Chinese-American physicist, astronomer and mathematician  women   90  A
4   105  Chen Xuejiao  women   73  C
5   106  Yang Jianfeng (1952-), *-US novelist  male   42  E
6   107  Lai Mui Ka  women   79  C
7   108   Zhang Xing (1935-1985), journalist and novelist   male   91  A
8   109  Ma Jinchun (1944-), Chinese politician  women   48  E
9   110  Wei Chuanjuan (1989-), Chinese athlete, lady high jumper  women  100  A
10  111  Wang Bingyan (1902-1992), Chinese novelist  women   64  D

Case 2

Question:How to find the monthly depreciation amount under the corresponding number in the depreciation schedule inside Sheet1? (Cross Sheet Search)

Methods:Enter =VLOOKUP(A2, Depreciation Schedule!AG$12, 7, 0) in cells C2:C4 inside Sheet1.

python implementation:Just join the two tables by number using merge

df1 = pd.read_excel("", sheet_name='Schedule of depreciation')
df2 = pd.read_excel("", sheet_name=1) # Sheet 1 in the title
(df1[['Number', 'Monthly depreciation']], how='left', on='Number')
Out[]: 
    serial number   Asset name  Monthly depreciation
0  YT001    electric door   1399
1  YT005  Santana sedan (automobile)  1147
2  YT008    printer    51

Case 3

Question:Similar to Case 2, but this time you need to use approximate lookups

Methods:Enter the formula = VLOOKUP(A2& "*", Depreciation Schedule!2:12, 6, 0) in the B2:B7 area.

python implementation:This one is a bit more tricky than the last one and requires some pandas usage skills

df1 = pd.read_excel("", sheet_name='Schedule of depreciation') 
df3 = pd.read_excel("", sheet_name=3) # Tables containing abbreviations of asset names
df3['Monthly depreciation'] = 0
for i in range(len(df3['Asset name'])):
    df3['Monthly depreciation'][i] = df1[df1['Asset name'].map(lambda x:df3['Asset name'][i] in x)]['Monthly depreciation']
 
df3
Out[]: 
  Asset name   Monthly depreciation
0   electric powered   1399
1   trucks   2438
2   Hewlett-Packard    132
3   crosslink  10133
4  Santana (name)   1147
5   lily magnolia (Magnolia liliflora)    230

Case 4

Question:When entering data information in Excel, in order to improve efficiency, the user wants to enter the keywords of the data, automatically display the rest of the record information, for example, enter the employee's work number automatically display the employee's letter of life, enter the material number can automatically display the name of the material, the unit price and so on.

As shown in the figure for a unit of all the employees of the basic information of the data source table, in the "March 2010 Employee Leave Statistics" worksheet, when entering the employee's work number in column A, how to achieve the corresponding employee's name, ID card number, department, position, date of entry and other information is automatically entered?

Methods:use VLOOKUP + MATCH function, in the "March 2010 Employee Leave Statistics" worksheet, select the B3: F8 cell area, enter the following formula = IF (A3, the basic information of the employee! H, MATCH (B basic information of employees 2: $ 2, 0), 0)), press 【Ctrl +Enter] key combination to end.

python implementation:The Excel method above is very flexible, but the idea and operation of pandas is more simple and convenient!

df4 = pd.read_excel("", sheet_name='Employee Basic Information Form')
df5 = pd.read_excel("", sheet_name='Leave statistics')
(df4[['Work number', 'Name', 'Sector', 'Position', 'Date of entry on duty']], on='Work number')
Out[]: 
      job title   name and surname  sectoral   duties       Entry date
0  A0004  Gong Mengjuan  logistics   supervisors 2006-11-20
1  A0003   Zhao Min  executive (adjective)   clerk 2007-02-16
2  A0005   Huang Ling (1875-1968), overlord of the Ming dynasty  research and development  hiring out 2009-01-14
3  A0007   Wang Wei (701-761), Tang Dynasty poet  the facts of life   managers 2006-07-24
4  A0016  Zhang Junbao  market (also in abstract)  hiring out 2007-08-14
5  A0017   Qin Yu (1919-1992), educator and prolific writer  the facts of life  副managers 2008-03-06

Case 5

Question:VLOOKUP function to achieve batch find, VLOOKUP function in general can only find a, then multiple should be how to find it? As shown below, how to list all the consumption of Zhang Yi?

Methods:Inside cell C9: C11 enter the formula = VLOOKUP (B $ 9 & ROW (A1), IF ({1,0}, $B $ 2: $B $ 6 & COUNTIF (INDIRECT ("b2: b" & ROW ($ 2: $ 6))), B $ 9), $C $ 2: $C $ 6), 2,), press SHIFT +CTRL+ENTER to end.

python implementation: vlookup function has two shortcomings (or characteristics of it), one is to find the value must be in the region of the first column, the other is to find only one value, the remaining even if the match is not going to look for the remaining, these two points can be flexibly applied if and indirect functions to solve, but pandas can be done more straightforward some.

df6 = pd.read_excel("", sheet_name='Consumption')
df6[df6['Name'] == 'Zhang Yi'][['Name', 'Consumption']]
Out[]: 
   name and surname   spending
0  Zhang Yi (1958-1981), student held up as a martyr after he died saving an old peasant from a famine   100
2  Zhang Yi (1958-1981), student held up as a martyr after he died saving an old peasant from a famine   300
4  Zhang Yi (1958-1981), student held up as a martyr after he died saving an old peasant from a famine  1000

pivot table

Pivot tables are another Excel godsend, essentially a series of tables reorganized and consolidated. The case used here is from Knowledge, Excel PivotWhat is the use of the view meter:(/question/22484899/answer/39933218 )

Problem: Need to summarize sales and cost totals for each region, for each month, and calculate the profit at the same time.

Through the operation of Excel's pivot table finally realized the following effect:

python implementation:For such a grouping of tasks, the first thing that comes to mind is the pandas groupby, the code is also simple to write, the idea is that the operation of the Excel point of the mouse just now reflected in the code command:

df = pd.read_excel('', sheet_name='Sales statistics table')
df['Order month'] = df['Order Date'].apply(lambda x:)
df2 = (['Order month', 'Region'])[['Sales', 'Costs']].agg('sum')
df2['Profit'] = df2['Sales'] - df2['Costs']
df2
 
Out[]: 
                 total income from sales         (manufacturing, production etc) costs        margins
Ordering month region
1    capital of China at different historical periods    134313.61   94967.84  39345.77
     Changshu    177531.47  163220.07  14311.40
     Wuxi prefecture level city in Jiangsu    316418.09  231822.28  84595.81
     Kunshan    159183.35  145403.32  13780.03
     Suzhou prefecture level city in Jiangsu    287253.99  238812.03  48441.96
2    capital of China at different historical periods    187129.13  138530.42  48598.71
     Changshu    154442.74  126834.37  27608.37
     Wuxi prefecture level city in Jiangsu    464012.20  376134.98  87877.22
     Kunshan    102324.46   86244.52  16079.94
     Suzhou prefecture level city in Jiangsu    105940.34   91419.54  14520.80
             ...        ...       ...
11   capital of China at different historical periods    286329.88  221687.11  64642.77
     Changshu   2118503.54 1840868.53 277635.01
     Wuxi prefecture level city in Jiangsu    633915.41  536866.77  97048.64
     Kunshan    351023.24  342420.18   8603.06
     Suzhou prefecture level city in Jiangsu   1269351.39 1144809.83 124541.56
12   capital of China at different historical periods    894522.06  808959.32  85562.74
     Changshu    324454.49  262918.81  61535.68
     Wuxi prefecture level city in Jiangsu   1040127.19  856816.72 183310.48
     Kunshan   1096212.75  951652.87 144559.87
     Suzhou prefecture level city in Jiangsu    347939.30  302154.25  45785.05
 
[60 rows x 3 columns]

This can also be done using the pivot_table function in pandas:

df3 = pd.pivot_table(df, values=['Sales', 'Costs'], index=['Order month', 'Region'] , aggfunc='sum')
df3['Profit'] = df3['Sales'] - df3['Costs']
df3 
 
Out[]: 
                  (manufacturing, production etc) costs        total income from sales        margins
Ordering month region
1    capital of China at different historical periods     94967.84  134313.61  39345.77
     Changshu    163220.07  177531.47  14311.40
     Wuxi prefecture level city in Jiangsu    231822.28  316418.09  84595.81
     Kunshan    145403.32  159183.35  13780.03
     Suzhou prefecture level city in Jiangsu    238812.03  287253.99  48441.96
2    capital of China at different historical periods    138530.42  187129.13  48598.71
     Changshu    126834.37  154442.74  27608.37
     Wuxi prefecture level city in Jiangsu    376134.98  464012.20  87877.22
     Kunshan     86244.52  102324.46  16079.94
     Suzhou prefecture level city in Jiangsu     91419.54  105940.34  14520.80
             ...        ...       ...
11   capital of China at different historical periods    221687.11  286329.88  64642.77
     Changshu   1840868.53 2118503.54 277635.01
     Wuxi prefecture level city in Jiangsu    536866.77  633915.41  97048.64
     Kunshan    342420.18  351023.24   8603.06
     Suzhou prefecture level city in Jiangsu   1144809.83 1269351.39 124541.56
12   capital of China at different historical periods    808959.32  894522.06  85562.74
     Changshu    262918.81  324454.49  61535.68
     Wuxi prefecture level city in Jiangsu    856816.72 1040127.19 183310.48
     Kunshan    951652.87 1096212.75 144559.87
     Suzhou prefecture level city in Jiangsu    302154.25  347939.30  45785.05
 
[60 rows x 3 columns]

The parameters index/ columns/ values of pivot_table in pandas correspond to the parameters in Excel (of course, I'm not saying this because it's not just rows/columns/values in a pivot table, but also what else.)

But I personally prefer to use groupby because it is very fast. When I was playing a kaggle competition, there was a table with behavioral information of lenders, about 27 million rows, and I used groupby to calculate a few aggregation functions, and it was done in a few seconds.

groupby is very comprehensive, built-in a lot ofaggregate function, to meet most of the basic needs, if you need some other functions, you can with the use ofapply andlambda.

However, the official documentation of pandas said, after groupby use apply is very slow, aggregate internal optimization, so very fast, apply is not optimized, so it is recommended to think of other ways to have a problem first, really can not be used again when apply.

When I played the tournament, I used groupby's apply in order to generate a new variable, and wrote this: ins['weight'] = ins[['SK_ID_PREV', 'DAYS_ENTRY_PAYMENT']].groupby('SK_ID_PREV').apply(lambda x: 1-abs(x)/().abs()).iloc[:,1], 10 million rows of data, counted for more than ten minutes, and waited me out.

drawings

Because Excel draws diagrams that can interact and perform some simple operations on the diagrams, the python visualization library used here is plotly, and the case will use the assignments from my development economics class this semester, when the diagrams were all drawn in Excel, and now they are drawn again in python. Before you start, first load the plotly package.

import  as off
import plotly.graph_objs as go
off.init_notebook_mode()

histogram

There were a lot of bar graphs drawn in Excel at the time, and one of them was

Here's a drawing using plotly:

df = pd.read_excel("", sheet_name='Enrolment in tertiary education')
trace1 = (
        x=df['The State'],
        y=df[1995],
        name='1995',
        opacity=0.6,
        marker=dict(
                color='powderblue'
                )
        )
 
trace2 = (
        x=df['The State'],
        y=df[2005],
        name='2005',
        opacity=0.6,
        marker=dict(
                color='aliceblue',
                )
        )
 
trace3 = (
        x=df['The State'],
        y=df[2014],
        name='2014',
        opacity=0.6,
        marker=dict(
                color='royalblue'
                )
        )
layout = (barmode='group')
data = [trace1, trace2, trace3]
fig = (data, layout)
(fig)

radar chart

Drawn in Excel:

Drawn in python:

df = pd.read_excel('', sheet_name='Political governance')
theta = ()
(theta[0])
names = 
df[''] = [:,0]
df = (df)
 
trace1 = (
        r=df[0],
        theta=theta,
        name=names[0]
        )
 
trace2 = (
        r=df[1],
        theta=theta,
        name=names[1]
        )
 
trace3 = (
        r=df[2],
        theta=theta,
        name=names[2]
        )
 
trace4 = (
        r=df[3],
        theta=theta,
        name=names[3]
        )
 
data = [trace1, trace2, trace3, trace4]
layout = (
        polar=dict(
                radialaxis=dict(
                        visible=True,
                        range=[0,1]
                        )
                ),
        showlegend=True
        )
fig = (data, layout)
(fig)

It's a lot trickier to draw than Excel.

Overall, it's easiest to use Excel if you're drawing simple, basic shapes, and more appropriate to use python if you're drawing more advanced ones or shapes that require more customization.

This article on the perfect combination of Python and Excel commonly used case summary of the operation of the article is introduced to this, more related to the combination of Python and Excel content, please search for my previous posts or continue to browse the following related articles I hope you will support me in the future!