Introduction to Python Machine Learning (II) of Python Data Understanding

What is Statistics? Probability and math. Using probability and math to analyze people is never a human being. Guiding people with conclusions that are never people is really a bias. In this sense, interpretation is stronger than technique.

--Liu Dehuan

1.Data import

When training a model for machine learning, a large amount of data is required, and the most common way to train a model is to utilize historical data. These historical data are usually stored in csv files or can be easily converted to csv files. When starting machine learning, we first import the csv data file.

A csv file is a text file separated by commas (,). Comments in csv files begin with (#).

In the next article, the Pima Indians dataset will be used, which was obtained from the UCI Machine Learning Repository (/ml/) in which it was obtained. It can also be downloaded from the web site (/s/1nv2xuVpXWHC1HUdS1c5QaQ) Extract code: d4im.

Pima Indians is a categorical problem dataset that focuses on recording medical data on whether Indians have had diabetes in the last five years.

1.1 Importing data using standard Python class libraries

Python provides a standard class library, CSV, for working with CSV files.

from csv import reader
 
#python standard library import data
 
filename = 'pima_data.csv'
with open(filename, 'rt') as raw_data:
    readers = reader(raw_data, delimiter=",")
    x = list(readers)
    data = (x).astype('float')
    print()

The code is relatively simple and will not be repeated here.

Run results:

(768, 9)

1.2 Importing data using Numpy

Use numpy's loadtxt() method to import data. Data processed using this function has no file header and all the data structures are the same, that is, the data types are the same.

import numpy as np
#Importing data using Numpy
from numpy import loadtxt
filename = 'pima_data.csv'
with open(filename, 'rt') as raw_data:
    data = loadtxt(raw_data, delimiter=',')
    print()

The first parameter in loadtxt is the data instance and the second parameter is the delimiter.

The output is the same as above

(768, 9)

1.3 Importing data using Pandas

To import a CSV file through Pandas use the pandas.read_csv() function. The return value of this function makes Data Frame. pandas is often utilized in machine learning projects for data processing and preparation. Therefore, it is recommended to use Pandas to import data.

#Recommended ！！！！
# Use Pandas to import data
from pandas import read_csv
filename = 'pima_data.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
#Set the header of the file
data = read_csv(filename, names=names)
print()
print((10))

Importing data using Pandas allows you to set the file headers for subsequent data comprehension. read_csv() method has two parameters, a file name and an array of file headers.

The output is the same as above

(768, 9)

2. Data comprehension

In order to get more accurate results, it is important to understand the characteristics of the data, its distribution, and the problem that needs to be solved, while building relevant algorithmic models and optimizing them.

2.1 Basic Data Attributes

Simple scrutiny of data is one of the most effective ways to enhance understanding of data. By looking at the data, it is possible to discover the relationships inherent in the data. These findings help to organize the data.

2.1.1 Viewing the first 10 rows of data

The dataset used remains the Pima Indians dataset:

from pandas import read_csv
 
filename = 'pima_data.csv'
names = ['preg','plas','pres','skin','test','mass', 'pedi','age','class']
data = read_csv(filename,names=names)
#View the top ten rows of data
print((10))

First import the dataset using pandas and then use the print function data data's head attribute to view the first 10 rows of data.

Output results:

preg plas pres skin test mass pedi age class
0 6 148 72 35 0 33.6 0.63 50 1
1 1 85 66 29 0 26.6 0.35 31 0
2 8 183 64 0 0 23.3 0.67 32 1
3 1 89 66 23 94 28.1 0.17 21 0
4 0 137 40 35 168 43.1 2.29 33 1
5 5 116 74 0 0 25.6 0.20 30 0
6 3 78 50 32 88 31.0 0.25 26 1
7 10 115 0 0 0 35.3 0.13 29 0
8 2 197 70 45 543 30.5 0.16 53 1
9 8 125 96 0 0 0.0 0.23 54 1

2.1.2 View data dimensions, data attributes and types:

'''
Data Dimensions
'''
#View data dimensions
# See how many rows and columns are in the dataset by using the shape property of the DATa Frame
print()
 
 
'''
Data attributes and types
'''
#View data attributes and types
#View the data type of each field via DATa Frame's Type property
print()

Running results:

(768, 9)
preg int64
plas int64
pres int64
skin int64
test int64
mass float64
pedi float64
age int64
class int64
dtype: object

2.1.3 Viewing data descriptive statistics

View descriptive statistics via DataFrame's describe() method. This includes: number of data, mean, standard variance, minimum, lower quartile, median, upper quartile, maximum. (Omit the section on reading data ahead)

from pandas import set_option
 
'''
Descriptive statistics
'''
#View descriptive statistics via DATa frame's describe() method
# of data records, mean residence, standardized variance, minimum value, lower quartile, median, upper quartile, maximum value
set_option('',100)
#Set the accuracy of the data
set_option('precision',2)
print("Descriptive Analysis of Data:")
print(())

Run results:

Descriptive analysis of data:
preg plas pres skin test mass pedi age class
count 768.00 768.00 768.00 768.00 768.00 768.00 768.00 768.00 768.00
mean 3.85 120.89 69.11 20.54 79.80 31.99 0.47 33.24 0.35
std 3.37 31.97 19.36 15.95 115.24 7.88 0.33 11.76 0.48
min 0.00 0.00 0.00 0.00 0.00 0.00 0.08 21.00 0.00
25% 1.00 99.00 62.00 0.00 0.00 27.30 0.24 24.00 0.00
50% 3.00 117.00 72.00 23.00 30.50 32.00 0.37 29.00 0.00
75% 6.00 140.25 80.00 32.00 127.25 36.60 0.63 41.00 1.00
max 17.00 199.00 122.00 99.00 846.00 67.10 2.42 81.00 1.00

2.2 Data correlation and distribution analysis

2.2.1 Data correlation matrix

The correlation of data attributes refers to whether two attributes of the data affect each other and in what way this affects. The Pearson correlation coefficient is commonly used to express the correlation between two attributes, which lies between (-1, 1). The correlation coefficient is used whenThe performance of some algorithms (e.g., Liner, logistic regression algorithms, etc.) decreases when the data is more highly correlated. So you need to check the relevance of the algorithm. Use theData Frame corr()method to compute the correlation matrix between data attributes.

print("Relevance of data attributes:")
print((method='pearson'))

The results are as follows:

Relevance of data attributes:
preg plas pres skin test mass pedi age class
preg 1.00 0.13 0.14 -0.08 -0.07 0.02 -0.03 0.54 0.22
plas 0.13 1.00 0.15 0.06 0.33 0.22 0.14 0.26 0.47
pres 0.14 0.15 1.00 0.21 0.09 0.28 0.04 0.24 0.07
skin -0.08 0.06 0.21 1.00 0.44 0.39 0.18 -0.11 0.07
test -0.07 0.33 0.09 0.44 1.00 0.20 0.19 -0.04 0.13
mass 0.02 0.22 0.28 0.39 0.20 1.00 0.14 0.04 0.29
pedi -0.03 0.14 0.04 0.18 0.19 0.14 1.00 0.03 0.17
age 0.54 0.26 0.24 -0.11 -0.04 0.04 0.03 1.00 0.24
class 0.22 0.47 0.07 0.07 0.13 0.29 0.17 0.24 1.00

2.2.2 Data distribution analysis

Confirmation of data bias is done by analyzing the Gaussian distribution of the data. Use theData Frame skew()method to calculate the deviation from Gaussian distribution for all data attributes.

print("Deviations from the Gaussian distribution of the data:")
print(())

The results are as follows:

The deviation from Gaussian distribution of the data:
preg 0.90
plas 0.17
pres -1.84
skin 0.11
test 2.27
mass -0.43
pedi 1.92
age 1.13
class 0.64
dtype: float64

3. Data visualization

The fastest and most effective way to make sense of data is through its visualization. We will use Matplotlib to visualize the data to better understand it.

3.1 Single chart

3.1.1 Histograms

Histograms are used more often and are not described here.

from pandas import read_csv
import  as plt
 
filename = 'pima_data.csv'
names = ['preg','plas','pres','skin','test','mass', 'pedi','age','class']
data = read_csv(filename,names=names)
 
'''
Histogram
'''
()
()

3.1.2 Density maps

A density plot is a graphical representation that shows boundary or domain objects corresponding to data values, generally used to present continuous variables. Density plots are similar to abstracting a histogram with a smooth line to describe the distribution of data.

'''
Density map
'''
(kind='density',subplots=True,layout=(3,3),sharex=False,sharey=False)
()

3.1.3 Box line diagrams

A box-and-line chart, also known as a box-and-whisker, box plot, or box-and-row plot, is a type of statistical chart used to show the dispersion of a set of data.

'''
Boxplots
'''
(kind='box',subplots=True,layout=(3,3),sharex=False,sharey=False)
()

3.2 Multiple Charts

3.2.1 Correlation matrix diagram

from pandas import read_csv
import  as plt
import numpy as np
 
filename = 'pima_data.csv'
names = ['preg','plas','pres','skin','test','mass','pedi','age','class']
data = read_csv(filename,names=names)
 
# Correlation matrix diagram
correlations = ()
fig = ()
ax = fig.add_subplot(111)
cax = (correlations, vmin=-1, vmax=1)
(cax)
ticks = (0,9,1)
ax.set_xticks(ticks)
ax.set_yticks(ticks)
ax.set_xticklabels(names)
ax.set_yticklabels(names)
()

3.2.2 Scatter Matrix Plots

from pandas import read_csv
import  as plt
import numpy as np
from  import scatter_matrix
 
filename = 'pima_data.csv'
names = ['preg','plas','pres','skin','test','mass','pedi','age','class']
data = read_csv(filename,names=names)
 
scatter_matrix(data)
()

summarize

This article focuses on some preparations before starting a machine learning project: importing data, data understanding and data visualization. There are three ways to import data: Python library functions, Numpy and Pandas import, and we recommend using Panads to import CSV files. Data understanding includes viewing some basic attributes of the data as well as viewing the correlation matrix and Gaussian distribution of the data. Data visualization mainly introduces some common methods of Matplotlib.

To this article on Python Machine Learning (II) data understanding of the article is introduced to this, more related Python Machine Learning (II) content, please search for my previous posts or continue to browse the following related articles I hope you will support me more in the future!