preamble
In the past, the corresponding English word for business analysis is Business Analysis, we use the analysis tool is Excel, and then the amount of data, Excel can not cope with it (Excel maximum number of rows to support 1048576 rows), people began to turn to python and R, such as analytical tools, this time the corresponding word for business analysis is Business Analytics.
In fact, the guidelines for using python and Excel are the same [We don't repeat ourselves], both of which are to replace mechanical operations and pure manual labor with more convenient operations whenever possible.
You can't do data analysis with python without the famous pandas package. After many versions of iterative optimization, the pandas ecosystem is now quite complete.The official website also gives a comparison between it and other analysis tools:
In this paper, the main use is also pandas, drawing with the library is plotly, the realization of the common functions of Excel have:
- Interaction between Python and Excel
- vlookup function
- pivot table
- drawings
I'll be back in the future to keep updating and adding to it if I uncover more Excel features.
Before you start, first load the pandas package as per usual:
import numpy as np import pandas as pd pd.set_option('max_columns', 10) pd.set_option('max_rows', 20) pd.set_option('display.float_format', lambda x: '%.2f' % x) # Prohibition of scientific notation
Interaction between Python and Excel
pandas in the most commonly used and Excel I/O related to the four functions are read_csv/ read_excel/ to_csv/ to_excel, they have specific parameter settings, you can customize the desired effect of reading and exporting.
Let's say you want to read the upper left part of such a table:
You can use pd.read_excel("", header=1, nrows=17, usecols=3) to return the results:
df
Out[]:
Work number Name Gender Sector
0 A0001 Zhang Wei Male Engineering
1 A0002 Wang Xiuying Female Personnel
2 A0003 Wang Fang Female Administration
3 A0004 Zheng Yong male Market
4 A0005 Zhang Li Female Research and development
5 A0006 Wang Yan Female Logistics
6 A0007 Li Yong male Market
7 A0008 Li Juan Female Engineering
8 A0009 Zhang Jing Female Personnel
9 A0010 Wang Lei Male Administrative
10 A0011 Li Na Female Market
11 A0012 Liu Shiwen Female R&D
12 A0013 Wang Gang Male Logistics
13 A0014 Ye Qian Female Logistics
14 A0015 Jin Wenwen Female Market
15 A0016 Wang Chaojie Male Engineering
16 A0017 Li Jun Male Personnel
Output function is the same, how many columns to use, to index or not, how to put the title, can be controlled.
vlookup function
vlookup is known as one of the artifacts in Excel, a wide range of uses, the following example from Douban, VLOOKUP function of the most commonly used 10 kinds of use, you will be a few kinds?
Case 1
Question:The A3:B7 cell area is a letter grade lookup table, indicating that a score of less than 60 is grade E, 60 to 69 is grade D, 70 to 79 is grade C, 80 to 89 is grade B, and 90 or more is grade A. How do I return the letter grades of my second grade 1 class based on their language scores, as shown in the table in column D:G for the second grade 1 class language test scores?
Methods:Enter =VLOOKUP (G3, 3:7, 2) in the H3:H13 cell area
python implementation:
df = pd.read_excel("", sheet_name=0) def grade_to_point(x): if x >= 90: return 'A' elif x >= 80: return 'B' elif x >= 70: return 'C' elif x >= 60: return 'D' else: return 'E' df['Grade'] = df['Languages'].apply(grade_to_point) df Out[]: student number name and surname distinguishing between the sexes multilingualism hierarchy 0 101 Wang Xiaoli (1906-1978), Mao *'s second wife women 69 D 1 102 Wang Baoqin male 85 B 2 103 Yang Yuping women 49 E 3 104 Tien Tung-kai (1932-), Chinese-American physicist, astronomer and mathematician women 90 A 4 105 Chen Xuejiao women 73 C 5 106 Yang Jianfeng (1952-), *-US novelist male 42 E 6 107 Lai Mui Ka women 79 C 7 108 Zhang Xing (1935-1985), journalist and novelist male 91 A 8 109 Ma Jinchun (1944-), Chinese politician women 48 E 9 110 Wei Chuanjuan (1989-), Chinese athlete, lady high jumper women 100 A 10 111 Wang Bingyan (1902-1992), Chinese novelist women 64 D
Case 2
Question:How to find the monthly depreciation amount under the corresponding number in the depreciation schedule inside Sheet1? (Cross Sheet Search)
Methods:Enter =VLOOKUP(A2, Depreciation Schedule!AG$12, 7, 0) in cells C2:C4 inside Sheet1.
python implementation:Just join the two tables by number using merge
df1 = pd.read_excel("", sheet_name='Schedule of depreciation') df2 = pd.read_excel("", sheet_name=1) # Sheet 1 in the title (df1[['Number', 'Monthly depreciation']], how='left', on='Number') Out[]: serial number Asset name Monthly depreciation 0 YT001 electric door 1399 1 YT005 Santana sedan (automobile) 1147 2 YT008 printer 51
Case 3
Question:Similar to Case 2, but this time you need to use approximate lookups
Methods:Enter the formula = VLOOKUP(A2& "*", Depreciation Schedule!2:12, 6, 0) in the B2:B7 area.
python implementation:This one is a bit more tricky than the last one and requires some pandas usage skills
df1 = pd.read_excel("", sheet_name='Schedule of depreciation') df3 = pd.read_excel("", sheet_name=3) # Tables containing abbreviations of asset names df3['Monthly depreciation'] = 0 for i in range(len(df3['Asset name'])): df3['Monthly depreciation'][i] = df1[df1['Asset name'].map(lambda x:df3['Asset name'][i] in x)]['Monthly depreciation'] df3 Out[]: Asset name Monthly depreciation 0 electric powered 1399 1 trucks 2438 2 Hewlett-Packard 132 3 crosslink 10133 4 Santana (name) 1147 5 lily magnolia (Magnolia liliflora) 230
Case 4
Question:When entering data information in Excel, in order to improve efficiency, the user wants to enter the keywords of the data, automatically display the rest of the record information, for example, enter the employee's work number automatically display the employee's letter of life, enter the material number can automatically display the name of the material, the unit price and so on.
As shown in the figure for a unit of all the employees of the basic information of the data source table, in the "March 2010 Employee Leave Statistics" worksheet, when entering the employee's work number in column A, how to achieve the corresponding employee's name, ID card number, department, position, date of entry and other information is automatically entered?
Methods:use VLOOKUP + MATCH function, in the "March 2010 Employee Leave Statistics" worksheet, select the B3: F8 cell area, enter the following formula = IF (A3, the basic information of the employee! H, MATCH (B basic information of employees 2: $ 2, 0), 0)), press 【Ctrl +Enter] key combination to end.
python implementation:The Excel method above is very flexible, but the idea and operation of pandas is more simple and convenient!
df4 = pd.read_excel("", sheet_name='Employee Basic Information Form') df5 = pd.read_excel("", sheet_name='Leave statistics') (df4[['Work number', 'Name', 'Sector', 'Position', 'Date of entry on duty']], on='Work number') Out[]: job title name and surname sectoral duties Entry date 0 A0004 Gong Mengjuan logistics supervisors 2006-11-20 1 A0003 Zhao Min executive (adjective) clerk 2007-02-16 2 A0005 Huang Ling (1875-1968), overlord of the Ming dynasty research and development hiring out 2009-01-14 3 A0007 Wang Wei (701-761), Tang Dynasty poet the facts of life managers 2006-07-24 4 A0016 Zhang Junbao market (also in abstract) hiring out 2007-08-14 5 A0017 Qin Yu (1919-1992), educator and prolific writer the facts of life 副managers 2008-03-06
Case 5
Question:VLOOKUP function to achieve batch find, VLOOKUP function in general can only find a, then multiple should be how to find it? As shown below, how to list all the consumption of Zhang Yi?
Methods:Inside cell C9: C11 enter the formula = VLOOKUP (B $ 9 & ROW (A1), IF ({1,0}, $B $ 2: $B $ 6 & COUNTIF (INDIRECT ("b2: b" & ROW ($ 2: $ 6))), B $ 9), $C $ 2: $C $ 6), 2,), press SHIFT +CTRL+ENTER to end.
python implementation: vlookup function has two shortcomings (or characteristics of it), one is to find the value must be in the region of the first column, the other is to find only one value, the remaining even if the match is not going to look for the remaining, these two points can be flexibly applied if and indirect functions to solve, but pandas can be done more straightforward some.
df6 = pd.read_excel("", sheet_name='Consumption') df6[df6['Name'] == 'Zhang Yi'][['Name', 'Consumption']] Out[]: name and surname spending 0 Zhang Yi (1958-1981), student held up as a martyr after he died saving an old peasant from a famine 100 2 Zhang Yi (1958-1981), student held up as a martyr after he died saving an old peasant from a famine 300 4 Zhang Yi (1958-1981), student held up as a martyr after he died saving an old peasant from a famine 1000
pivot table
Pivot tables are another Excel godsend, essentially a series of tables reorganized and consolidated. The case used here is from Knowledge, Excel PivotWhat is the use of the view meter:(/question/22484899/answer/39933218 )
Problem: Need to summarize sales and cost totals for each region, for each month, and calculate the profit at the same time.
Through the operation of Excel's pivot table finally realized the following effect:
python implementation:For such a grouping of tasks, the first thing that comes to mind is the pandas groupby, the code is also simple to write, the idea is that the operation of the Excel point of the mouse just now reflected in the code command:
df = pd.read_excel('', sheet_name='Sales statistics table') df['Order month'] = df['Order Date'].apply(lambda x:) df2 = (['Order month', 'Region'])[['Sales', 'Costs']].agg('sum') df2['Profit'] = df2['Sales'] - df2['Costs'] df2 Out[]: total income from sales (manufacturing, production etc) costs margins Ordering month region 1 capital of China at different historical periods 134313.61 94967.84 39345.77 Changshu 177531.47 163220.07 14311.40 Wuxi prefecture level city in Jiangsu 316418.09 231822.28 84595.81 Kunshan 159183.35 145403.32 13780.03 Suzhou prefecture level city in Jiangsu 287253.99 238812.03 48441.96 2 capital of China at different historical periods 187129.13 138530.42 48598.71 Changshu 154442.74 126834.37 27608.37 Wuxi prefecture level city in Jiangsu 464012.20 376134.98 87877.22 Kunshan 102324.46 86244.52 16079.94 Suzhou prefecture level city in Jiangsu 105940.34 91419.54 14520.80 ... ... ... 11 capital of China at different historical periods 286329.88 221687.11 64642.77 Changshu 2118503.54 1840868.53 277635.01 Wuxi prefecture level city in Jiangsu 633915.41 536866.77 97048.64 Kunshan 351023.24 342420.18 8603.06 Suzhou prefecture level city in Jiangsu 1269351.39 1144809.83 124541.56 12 capital of China at different historical periods 894522.06 808959.32 85562.74 Changshu 324454.49 262918.81 61535.68 Wuxi prefecture level city in Jiangsu 1040127.19 856816.72 183310.48 Kunshan 1096212.75 951652.87 144559.87 Suzhou prefecture level city in Jiangsu 347939.30 302154.25 45785.05 [60 rows x 3 columns]
This can also be done using the pivot_table function in pandas:
df3 = pd.pivot_table(df, values=['Sales', 'Costs'], index=['Order month', 'Region'] , aggfunc='sum') df3['Profit'] = df3['Sales'] - df3['Costs'] df3 Out[]: (manufacturing, production etc) costs total income from sales margins Ordering month region 1 capital of China at different historical periods 94967.84 134313.61 39345.77 Changshu 163220.07 177531.47 14311.40 Wuxi prefecture level city in Jiangsu 231822.28 316418.09 84595.81 Kunshan 145403.32 159183.35 13780.03 Suzhou prefecture level city in Jiangsu 238812.03 287253.99 48441.96 2 capital of China at different historical periods 138530.42 187129.13 48598.71 Changshu 126834.37 154442.74 27608.37 Wuxi prefecture level city in Jiangsu 376134.98 464012.20 87877.22 Kunshan 86244.52 102324.46 16079.94 Suzhou prefecture level city in Jiangsu 91419.54 105940.34 14520.80 ... ... ... 11 capital of China at different historical periods 221687.11 286329.88 64642.77 Changshu 1840868.53 2118503.54 277635.01 Wuxi prefecture level city in Jiangsu 536866.77 633915.41 97048.64 Kunshan 342420.18 351023.24 8603.06 Suzhou prefecture level city in Jiangsu 1144809.83 1269351.39 124541.56 12 capital of China at different historical periods 808959.32 894522.06 85562.74 Changshu 262918.81 324454.49 61535.68 Wuxi prefecture level city in Jiangsu 856816.72 1040127.19 183310.48 Kunshan 951652.87 1096212.75 144559.87 Suzhou prefecture level city in Jiangsu 302154.25 347939.30 45785.05 [60 rows x 3 columns]
The parameters index/ columns/ values of pivot_table in pandas correspond to the parameters in Excel (of course, I'm not saying this because it's not just rows/columns/values in a pivot table, but also what else.)
But I personally prefer to use groupby because it is very fast. When I was playing a kaggle competition, there was a table with behavioral information of lenders, about 27 million rows, and I used groupby to calculate a few aggregation functions, and it was done in a few seconds.
groupby is very comprehensive, built-in a lot ofaggregate function, to meet most of the basic needs, if you need some other functions, you can with the use ofapply andlambda.
However, the official documentation of pandas said, after groupby use apply is very slow, aggregate internal optimization, so very fast, apply is not optimized, so it is recommended to think of other ways to have a problem first, really can not be used again when apply.
When I played the tournament, I used groupby's apply in order to generate a new variable, and wrote this: ins['weight'] = ins[['SK_ID_PREV', 'DAYS_ENTRY_PAYMENT']].groupby('SK_ID_PREV').apply(lambda x: 1-abs(x)/().abs()).iloc[:,1], 10 million rows of data, counted for more than ten minutes, and waited me out.
drawings
Because Excel draws diagrams that can interact and perform some simple operations on the diagrams, the python visualization library used here is plotly, and the case will use the assignments from my development economics class this semester, when the diagrams were all drawn in Excel, and now they are drawn again in python. Before you start, first load the plotly package.
import as off import plotly.graph_objs as go off.init_notebook_mode()
histogram
There were a lot of bar graphs drawn in Excel at the time, and one of them was
Here's a drawing using plotly:
df = pd.read_excel("", sheet_name='Enrolment in tertiary education') trace1 = ( x=df['The State'], y=df[1995], name='1995', opacity=0.6, marker=dict( color='powderblue' ) ) trace2 = ( x=df['The State'], y=df[2005], name='2005', opacity=0.6, marker=dict( color='aliceblue', ) ) trace3 = ( x=df['The State'], y=df[2014], name='2014', opacity=0.6, marker=dict( color='royalblue' ) ) layout = (barmode='group') data = [trace1, trace2, trace3] fig = (data, layout) (fig)
radar chart
Drawn in Excel:
Drawn in python:
df = pd.read_excel('', sheet_name='Political governance') theta = () (theta[0]) names = df[''] = [:,0] df = (df) trace1 = ( r=df[0], theta=theta, name=names[0] ) trace2 = ( r=df[1], theta=theta, name=names[1] ) trace3 = ( r=df[2], theta=theta, name=names[2] ) trace4 = ( r=df[3], theta=theta, name=names[3] ) data = [trace1, trace2, trace3, trace4] layout = ( polar=dict( radialaxis=dict( visible=True, range=[0,1] ) ), showlegend=True ) fig = (data, layout) (fig)
It's a lot trickier to draw than Excel.
Overall, it's easiest to use Excel if you're drawing simple, basic shapes, and more appropriate to use python if you're drawing more advanced ones or shapes that require more customization.
This article on the perfect combination of Python and Excel commonly used case summary of the operation of the article is introduced to this, more related to the combination of Python and Excel content, please search for my previous posts or continue to browse the following related articles I hope you will support me in the future!