What is Statistics? Probability and math. Using probability and math to analyze people is never a human being. Guiding people with conclusions that are never people is really a bias. In this sense, interpretation is stronger than technique.
--Liu Dehuan
1.Data import
When training a model for machine learning, a large amount of data is required, and the most common way to train a model is to utilize historical data. These historical data are usually stored in csv files or can be easily converted to csv files. When starting machine learning, we first import the csv data file.
A csv file is a text file separated by commas (,). Comments in csv files begin with (#).
In the next article, the Pima Indians dataset will be used, which was obtained from the UCI Machine Learning Repository (/ml/) in which it was obtained. It can also be downloaded from the web site (/s/1nv2xuVpXWHC1HUdS1c5QaQ) Extract code: d4im.
Pima Indians is a categorical problem dataset that focuses on recording medical data on whether Indians have had diabetes in the last five years.
1.1 Importing data using standard Python class libraries
Python provides a standard class library, CSV, for working with CSV files.
from csv import reader #python standard library import data filename = 'pima_data.csv' with open(filename, 'rt') as raw_data: readers = reader(raw_data, delimiter=",") x = list(readers) data = (x).astype('float') print()
The code is relatively simple and will not be repeated here.
Run results:
(768, 9)
1.2 Importing data using Numpy
Use numpy's loadtxt() method to import data. Data processed using this function has no file header and all the data structures are the same, that is, the data types are the same.
import numpy as np #Importing data using Numpy from numpy import loadtxt filename = 'pima_data.csv' with open(filename, 'rt') as raw_data: data = loadtxt(raw_data, delimiter=',') print()
The first parameter in loadtxt is the data instance and the second parameter is the delimiter.
The output is the same as above
(768, 9)
1.3 Importing data using Pandas
To import a CSV file through Pandas use the pandas.read_csv() function. The return value of this function makes Data Frame. pandas is often utilized in machine learning projects for data processing and preparation. Therefore, it is recommended to use Pandas to import data.
#Recommended !!!! # Use Pandas to import data from pandas import read_csv filename = 'pima_data.csv' names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'] #Set the header of the file data = read_csv(filename, names=names) print() print((10))
Importing data using Pandas allows you to set the file headers for subsequent data comprehension. read_csv() method has two parameters, a file name and an array of file headers.
The output is the same as above
(768, 9)
2. Data comprehension
In order to get more accurate results, it is important to understand the characteristics of the data, its distribution, and the problem that needs to be solved, while building relevant algorithmic models and optimizing them.
2.1 Basic Data Attributes
Simple scrutiny of data is one of the most effective ways to enhance understanding of data. By looking at the data, it is possible to discover the relationships inherent in the data. These findings help to organize the data.
2.1.1 Viewing the first 10 rows of data
The dataset used remains the Pima Indians dataset:
from pandas import read_csv filename = 'pima_data.csv' names = ['preg','plas','pres','skin','test','mass', 'pedi','age','class'] data = read_csv(filename,names=names) #View the top ten rows of data print((10))
First import the dataset using pandas and then use the print function data data's head attribute to view the first 10 rows of data.
Output results:
preg plas pres skin test mass pedi age class
0 6 148 72 35 0 33.6 0.63 50 1
1 1 85 66 29 0 26.6 0.35 31 0
2 8 183 64 0 0 23.3 0.67 32 1
3 1 89 66 23 94 28.1 0.17 21 0
4 0 137 40 35 168 43.1 2.29 33 1
5 5 116 74 0 0 25.6 0.20 30 0
6 3 78 50 32 88 31.0 0.25 26 1
7 10 115 0 0 0 35.3 0.13 29 0
8 2 197 70 45 543 30.5 0.16 53 1
9 8 125 96 0 0 0.0 0.23 54 1
2.1.2 View data dimensions, data attributes and types:
''' Data Dimensions ''' #View data dimensions # See how many rows and columns are in the dataset by using the shape property of the DATa Frame print() ''' Data attributes and types ''' #View data attributes and types #View the data type of each field via DATa Frame's Type property print()
Running results:
(768, 9)
preg int64
plas int64
pres int64
skin int64
test int64
mass float64
pedi float64
age int64
class int64
dtype: object
2.1.3 Viewing data descriptive statistics
View descriptive statistics via DataFrame's describe() method. This includes: number of data, mean, standard variance, minimum, lower quartile, median, upper quartile, maximum. (Omit the section on reading data ahead)
from pandas import set_option ''' Descriptive statistics ''' #View descriptive statistics via DATa frame's describe() method # of data records, mean residence, standardized variance, minimum value, lower quartile, median, upper quartile, maximum value set_option('',100) #Set the accuracy of the data set_option('precision',2) print("Descriptive Analysis of Data:") print(())
Run results:
Descriptive analysis of data:
preg plas pres skin test mass pedi age class
count 768.00 768.00 768.00 768.00 768.00 768.00 768.00 768.00 768.00
mean 3.85 120.89 69.11 20.54 79.80 31.99 0.47 33.24 0.35
std 3.37 31.97 19.36 15.95 115.24 7.88 0.33 11.76 0.48
min 0.00 0.00 0.00 0.00 0.00 0.00 0.08 21.00 0.00
25% 1.00 99.00 62.00 0.00 0.00 27.30 0.24 24.00 0.00
50% 3.00 117.00 72.00 23.00 30.50 32.00 0.37 29.00 0.00
75% 6.00 140.25 80.00 32.00 127.25 36.60 0.63 41.00 1.00
max 17.00 199.00 122.00 99.00 846.00 67.10 2.42 81.00 1.00
2.2 Data correlation and distribution analysis
2.2.1 Data correlation matrix
The correlation of data attributes refers to whether two attributes of the data affect each other and in what way this affects. The Pearson correlation coefficient is commonly used to express the correlation between two attributes, which lies between (-1, 1). The correlation coefficient is used whenThe performance of some algorithms (e.g., Liner, logistic regression algorithms, etc.) decreases when the data is more highly correlated. So you need to check the relevance of the algorithm. Use theData Frame corr()
method to compute the correlation matrix between data attributes.
print("Relevance of data attributes:") print((method='pearson'))
The results are as follows:
Relevance of data attributes:
preg plas pres skin test mass pedi age class
preg 1.00 0.13 0.14 -0.08 -0.07 0.02 -0.03 0.54 0.22
plas 0.13 1.00 0.15 0.06 0.33 0.22 0.14 0.26 0.47
pres 0.14 0.15 1.00 0.21 0.09 0.28 0.04 0.24 0.07
skin -0.08 0.06 0.21 1.00 0.44 0.39 0.18 -0.11 0.07
test -0.07 0.33 0.09 0.44 1.00 0.20 0.19 -0.04 0.13
mass 0.02 0.22 0.28 0.39 0.20 1.00 0.14 0.04 0.29
pedi -0.03 0.14 0.04 0.18 0.19 0.14 1.00 0.03 0.17
age 0.54 0.26 0.24 -0.11 -0.04 0.04 0.03 1.00 0.24
class 0.22 0.47 0.07 0.07 0.13 0.29 0.17 0.24 1.00
2.2.2 Data distribution analysis
Confirmation of data bias is done by analyzing the Gaussian distribution of the data. Use theData Frame skew()
method to calculate the deviation from Gaussian distribution for all data attributes.
print("Deviations from the Gaussian distribution of the data:") print(())
The results are as follows:
The deviation from Gaussian distribution of the data:
preg 0.90
plas 0.17
pres -1.84
skin 0.11
test 2.27
mass -0.43
pedi 1.92
age 1.13
class 0.64
dtype: float64
3. Data visualization
The fastest and most effective way to make sense of data is through its visualization. We will use Matplotlib to visualize the data to better understand it.
3.1 Single chart
3.1.1 Histograms
Histograms are used more often and are not described here.
from pandas import read_csv import as plt filename = 'pima_data.csv' names = ['preg','plas','pres','skin','test','mass', 'pedi','age','class'] data = read_csv(filename,names=names) ''' Histogram ''' () ()
3.1.2 Density maps
A density plot is a graphical representation that shows boundary or domain objects corresponding to data values, generally used to present continuous variables. Density plots are similar to abstracting a histogram with a smooth line to describe the distribution of data.
''' Density map ''' (kind='density',subplots=True,layout=(3,3),sharex=False,sharey=False) ()
3.1.3 Box line diagrams
A box-and-line chart, also known as a box-and-whisker, box plot, or box-and-row plot, is a type of statistical chart used to show the dispersion of a set of data.
''' Boxplots ''' (kind='box',subplots=True,layout=(3,3),sharex=False,sharey=False) ()
3.2 Multiple Charts
3.2.1 Correlation matrix diagram
from pandas import read_csv import as plt import numpy as np filename = 'pima_data.csv' names = ['preg','plas','pres','skin','test','mass','pedi','age','class'] data = read_csv(filename,names=names) # Correlation matrix diagram correlations = () fig = () ax = fig.add_subplot(111) cax = (correlations, vmin=-1, vmax=1) (cax) ticks = (0,9,1) ax.set_xticks(ticks) ax.set_yticks(ticks) ax.set_xticklabels(names) ax.set_yticklabels(names) ()
3.2.2 Scatter Matrix Plots
from pandas import read_csv import as plt import numpy as np from import scatter_matrix filename = 'pima_data.csv' names = ['preg','plas','pres','skin','test','mass','pedi','age','class'] data = read_csv(filename,names=names) scatter_matrix(data) ()
summarize
This article focuses on some preparations before starting a machine learning project: importing data, data understanding and data visualization. There are three ways to import data: Python library functions, Numpy and Pandas import, and we recommend using Panads to import CSV files. Data understanding includes viewing some basic attributes of the data as well as viewing the correlation matrix and Gaussian distribution of the data. Data visualization mainly introduces some common methods of Matplotlib.
To this article on Python Machine Learning (II) data understanding of the article is introduced to this, more related Python Machine Learning (II) content, please search for my previous posts or continue to browse the following related articles I hope you will support me more in the future!