I. Creation of two major data structures in Pandas
serial number | methodologies | clarification |
---|---|---|
1 | (object,index=[ ]) | Create a Series. objects can be lists \ndarray, dictionaries, and a row or column in a DataFrame |
2 | (data,columns = [ ],index = [ ]) | Create a DataFrame. columns and indexes are the specified column and row indexes and are in order. |
Example: creating a data table with pandas:
df = ({"id":[1001,1002,1003,1004,1005,1006], "date":pd.date_range('20130102', periods=6), "city":['Beijing ', 'SH', ' guangzhou ', 'Shenzhen', 'shanghai', 'BEIJING '], "age":[23,44,54,32,34,32], "category":['100-A','100-B','110-A','110-C','210-A','130-F'], "price":[1200,,2133,5433,,4432]}, columns =['id','date','city','category','age','price'])
II. DataFrame Common Methods
serial number | methodologies | clarification |
---|---|---|
1 | () | Query the first five rows of the data |
2 | () | Query the last 5 rows of the data |
3 | () | Discretize variables into equal-sized buckets based on rank or based on sample quartiles |
4 | () | Quantile-based discretization function |
5 | pandas.date_range() | Returns a time index |
6 | () | Apply the function along the corresponding axis |
7 | Series.value_counts() | Return the count value of different data |
8 | df.reset_index() | reset index, parameter drop = True when the original index will be discarded, set a new index from 0, often used with groupby () |
Example: Re-indexing
df_inner.reset_index()
III. Data indexing
serial number | methodologies | clarification |
---|---|---|
1 | .values | Convert DataFrame to ndarray two-dimensional array |
2 | .append(idx) | Connect another Index object to produce a new Index object |
3 | .insert(loc,e) | Add an element to the loc position |
4 | .delete(loc) | Delete the element at loc |
5 | .union(idx) | compute parallel sets (math.) |
6 | .intersection(idx) | Compute the intersection |
7 | .diff(idx) | Calculate the difference set to produce a new Index object |
8 | .reindex(index, columns ,fill_value, method, limit, copy ) | Changing, rearranging Series and DataFrame indexes creates a new object and introduces missing values if an index value does not currently exist. |
9 | .drop() | Deletes Series and DataFrame specified row or column indexes. |
10 | .loc [row labels, column labels] | Queries the specified data by label, the first value is the row label and the second value is the column label. |
11 | [Row position, column position] | Queries the specified data through a numeric index generated by default. |
Example: Extracting a single row of values by index
df_inner.loc[3]
IV. DataFrame method of selecting and recombining data
serial number | methodologies | clarification |
---|---|---|
1 | df[val] | Selection of a single column or a group of columns from a DataFrame; convenient in special cases: Boolean arrays (filtering rows), slices (slicing rows), or Boolean DataFrames (setting values based on conditions) |
2 | [val] | Selecting a single row or a group of rows of a DataFrame by labeling them |
3 | [:,val] | Select a single column or a subset of columns by labeling them |
4 | df.1oc[val1,val2] | Select both rows and columns by tabbing |
5 | [where] | Picking a single row or subset of rows from a DataFrame by integer position |
6 | [:,where] | Picking a single column or subset of columns from a DataFrame by integer position |
7 | [where_i,where_j] | Simultaneous selection of rows and columns by integer position |
8 | [1abel_i,1abel_j] | Selection of a single scalar by row and column labeling |
9 | [i,j] | Pick a single scalar by row and column positions (integers) |
10 | reindex | Selection of rows or columns by labels |
11 | get_value | Selection of single values by row and column labels |
12 | set_value | Selection of single values by row and column labels |
Example: Using iloc to extract data by location area
df_inner.iloc[:3,:2]
The numbers before and after the # colon are no longer the label name of the index, but where the data is located, starting at 0, the first three rows, and the first two columns.
V. Sorting
serial number | function (math.) | clarification |
---|---|---|
1 | .sort_index(axis=0, ascending=True) | Sort by the value of the specified axis index |
2 | Series.sort_values(axis=0, ascending=True) | It can only be sorted according to the value of the 0-axis. |
3 | DataFrame.sort_values(by, axis=0, ascending=True) | The parameter by is an index or a list of indexes on the axis. |
Example: Sort by index column
df_inner.sort_index()
VI. Relevant analysis and statistical analysis
serial number | methodologies | clarification |
---|---|---|
1 | .idxmin() | Calculate the index where the minimum value of the data is located (custom index) |
2 | .idxmax() | Calculate the index where the maximum value of the data is located (custom index) |
3 | .argmin() | Calculate the index position where the minimum value of the data is located (automatic indexing) |
4 | .argmax() | Calculate the index position where the maximum value of the data is located (automatic indexing) |
5 | .describe() | Multiple statistical summaries for each column to quickly describe a summary of the data using statistical indicators |
6 | .sum() | Calculate the sum of the columns |
7 | .count() | Number of non-NaN values |
8 | .mean( ) | Calculate the arithmetic mean of the data |
9 | .median() | calculate the arithmetic median |
10 | .var() | Calculate the variance of the data |
11 | .std() | Calculate the standard deviation of the data |
12 | .corr() | Calculate the correlation coefficient matrix |
13 | .cov() | Calculate the covariance matrix |
14 | .corrwith() | Using the corrwith method of a DataFrame, you can calculate the correlation coefficient between its columns or rows and another Series or DataFrame. |
15 | .min() | Calculate the minimum value of the data |
16 | .max() | Calculate the maximum value of the data |
17 | .diff() | Calculating first-order differences works well for time series |
18 | .mode() | Calculate the number of plurals, returning the one(s) with the highest frequency |
19 | .mean() | Calculate the mean value |
20 | .quantile() | Calculation of quartiles (0 to 1) |
21 | .isin() | Used to determine membership in vectorized collections, can be used to filter subsets of data in Series or DataFrame columns |
22 | .unique() | Returns an array of unique values in a Series. |
23 | .value_counts() | Calculates the frequency of occurrence of each value in a Series. |
Example: Determine if the value in the city column is Beijing.
df_inner['city'].isin(['beijing'])
VII. Approach to clustering
serial number | methodologies | clarification |
---|---|---|
1 | () | grouping function |
2 | () | According to the characteristics of the data analysis object, according to certain numerical indicators, the data analysis object is divided into different parts of the interval to be studied, in order to reveal its intrinsic connection and regularity. |
Example: .groupby usage
group_by_name=('name')
print(type(group_by_name))
The output result is:
class ''
VIII. Methods for reading and writing text-format data
serial number | methodologies | clarification |
---|---|---|
1 | read_csv | Loads delimited data from files, URLs, and file-type objects. The default delimiter is a comma |
2 | read_table | Loads delimited data from files, URLs, and file-type objects. The default delimiter is tab (t) |
3 | read_ fwf | Read fixed-width column-format data (i.e., no delimiters) |
4 | read_clipboard | Reading data from the clipboard can be seen as the clipboard version of read_table. Useful when converting web pages to tables. |
5 | read_excel | Reading table data from ExcelXLS or XLSXfile |
6 | read_hdf | Reading HDF5 files written in pandas |
7 | read_html | Read all tables in an HTML document. |
8 | read_json | Reading data from a JSON string |
9 | read_msgpack | Encoded pandas data in binary format |
10 | read_pickle | Read any object stored in the Python pickle format |
11 | read_sas | Read SAS datasets stored in a SAS system's customized storage format |
12 | read_sql | Reading SQL query results as DataFrame in pandas |
13 | read_stata | Reading a dataset in Stata file format |
14 | read_feather | Read Feather binary file format |
Example: Importing a CSV or xlsx file
df = (pd.read_csv('',header=1))
df = (pd.read_excel(''))
IX. Dealing with missing data
serial number | methodologies | clarification |
---|---|---|
1 | .fillna(value,method,limit,inplace) | Filling in missing values |
2 | .dropna() | Deletion of missing data |
3 | .info() | View information about the data, including the name of each field, the number of non-nulls, the data type of the field |
4 | .isnull() | Returns an object (Series or DataFrame) of the same length with Boolean values, indicating which values are missing |
Example: Viewing basic data table information (dimensions, column names, data formats, etc.)
()
X. Data conversion
serial number | methodologies | clarification |
---|---|---|
1 | .replace(old, new) | Replaces old data with new data, old and new can be lists if you wish to replace multiple values at once. A new object is returned by default, pass inplace=True to modify an existing object in-place. |
2 | .duplicated() | Determines whether each row is a duplicate row, returning a Boolean Series. |
3 | .drop_duplicates() | Removes duplicate rows and returns the deleted DataFrame object. |
Example: Duplicate values that appear after deletion:
df['city'].drop_duplicates()
This article summarizes are some of the common methods of Pandas, as for some basic concepts you need to learn Pandas to understand, such as what is Series, what is DataFrame? If you have a clear understanding of these basic things after Pandas, with these methods in the article, then you use Pandas to do data processing and analysis will inevitably be at ease.