HomeBlogData ScienceStatistics for Data Science with Python [Beginner’s Guide]

Statistics for Data Science with Python [Beginner’s Guide]

Published
14th Jul, 2023
Views
view count loader
Read it in
18 Mins
Statistics for Data Science with Python [Beginner’s Guide]

Statistics is a well-known field primarily concerned with data collecting, data organization, data analysis, data interpretation, and data visualization. In the past, statisticians, economists, and business leaders used statistics to calculate and portray relevant information in their respective fields. Statistics has assumed a crucial role in several domains today, including data science, machine learning, business intelligence, computer science, and many more. One of the first steps in learning data science is to become familiar with statistics and maths. If you recall correctly, the next phase is to learn how to code. In this blog, we will discuss about statistics for data science using Python. Let’s start!

Why Python for Statistics?

Python's ease of use and straightforward syntax are two of the key reasons it is so popular in the scientific and research fields. A Python is an important tool in the data analyst's toolkit since it is designed for doing repetitive activities and data processing. Anyone who has worked with big volumes of data understands how often repetition occurs. probability and statistics in data science using Python is very easy to implement on our datasets. Because a tool performs the menial labor, data analysts may focus on the more intriguing and rewarding aspects of their jobs. statistics for data science Python and applied statistics with Python play a vital role in paving the path of a data scientist.

Some of the primary reasons for using Python for statistical analysis are as follows:

1. Python statistics library that is open source

There are numerous open-source Python libraries and Python statistics packages for data manipulation, data visualization, statistics, mathematics, machine learning, and natural language processing. Pandas, matplotlib, scikit-learn, and SciPy are examples of Python statistic libraries for Python statistics for Python.

2. Less code line

Python gives programmers the advantage of needing fewer lines of code to complete things than older languages require. Python statistics can help you accomplish outstanding data analysis with relatively few lines of code.

3. Great support

Python, fortunately, has a significant following and is widely used in academic and industry circles. Thus, there are many excellent analytics libraries accessible. Python users in need of assistance can always seek assistance from Stack Overflow, mailing lists, and user-contributed code and documentation. And as Python grows in popularity, more users will submit information about their user experiences, resulting in more free assistance material. It's no surprise that Python's popularity is growing! The Data Science Professional Certificate can help you learn about the fundamental data types with descriptive analysis methods, Series, and Data frames.

Understanding Descriptive Statistics

Descriptive statistics, in general, refers to the process of describing data using representative methods such as charts, tables, Excel files, etc. The data has been described in such a way that it can communicate relevant information that can also be utilized to predict future trends. Univariate analysis is the process of describing and summarising a single variable. Bivariate analysis is the process of describing a statistical relationship between two variables. Multivariate analysis is the process of describing the statistical relationship between many variables. Python descriptive statistics are used to implement descriptive statistics.

A) Types of Measures

Descriptive statistics are classified into two types:

1. Measure of central tendency

The central tendency measure is a single value that seeks to describe the entire set of data. The three main characteristics of central tendency are as follows:

a. Mean

It is calculated by dividing the total number of observations by the sum of the observations. It can also be described as the sum divided by the count.

b. Median

(n+1)/2

It is the data set's middle value. It divides the data into two halves. If the number of items in the data set is odd, the center element is the median; otherwise, the median is the average of two center elements.

c. Mode

It is the most often occurring value in the given data collection. If the frequency of all data points is the same, the data set may not have a mode. We can also have several modes if we meet two or more data points with the same frequency.

2. Measure of variability

The spread of data, or how well our data is dispersed, is a measure of variability. The most common measures of variability are:

a. Standard deviation 

It is calculated by taking the square root of the variance. It is determined by first determining the Mean, then subtracting each number from the Mean, also known as the average, and squaring the result. Adding the values, dividing by the number of words, and finally taking the square root.

b. Range 

The range represents the difference between the largest and smallest data points in our data set. The range is proportional to the spread of data, so the wider the range, the wider the spread of data, and vice versa.

Range = Largest data value – smallest data value

c. Variance

It is defined as a squared deviation from the mean on average. It is determined by squaring the difference between each data point and the average, also known as the mean, adding all of them, and then dividing by the number of data points in our data collection.

measure of variability

B) Population and Samples

The population is a grouping of all the elements or things you are interested in statistics. Populations are frequently large, making them unsuitable for data collection and analysis. That is why statisticians typically attempt to draw conclusions about a population by selecting and analyzing a representative subset of that group.

This subset of a population is referred to as a sample. Ideally, the sample should preserve the population's key statistical traits to a reasonable degree. You'll be able to conclude the population based on the sample.

C) Outliers

A data point that deviates significantly from the rest of the data in a sample or population is referred to as an outlier.

 Outliers can have a variety of causes, but here is a handful to get you started:

  • Natural data variation
  • Changes in the observed system's behavior
  • Data gathering errors

Outliers are frequently caused by data-gathering problems

Note: Outliers do not have a precise mathematical meaning. To decide if a data point is an outlier and how to treat it, you must rely on the expertise, understanding of the area of interest, and common sense. 

Skills covered like Descriptive Statistics and Inferential Statistics, make our Data Science Bootcamp worth it when you are looking to take your data science career to the next level.

Choosing Python Statistics Libraries 

There are numerous Python statistics libraries available for use, but in this tutorial, you'll learn about some of the more popular and extensively used ones:

1. Python’s Statistics

It is a built-in Python module for descriptive statistics. If your datasets are not too large or if you cannot rely on importing other libraries, you can utilize it.

2. NumPy

It is a third-party numerical computing package that is optimized for working with single- and multi-dimensional arrays. Its primary type is an array known as ndarray. This package offers a large number of statistical analysis routines.

3. SciPy

It is a NumPy-based third-party library for scientific computing. It provides more capabilities than NumPy, such as scipy.stats for statistical analysis.

4. Pandas

It is a NumPy-based third-party library for numerical computing. It excels at labelled one-dimensional (1D) data handling with Series objects and two-dimensional (2D) data handling with DataFrame objects.

5. Matplotlib

It is a third-party data visualization package. It is useful in conjunction with NumPy, SciPy, and Pandas.

Getting Started with Python Statistics Libraries

The Python statistics library includes only a subset of the most relevant statistics routines. If you can only use Python, the Python statistics library might be the best option.

If you want to learn Pandas, the official Getting Started page is an excellent place to begin. Matplotlib has a comprehensive official User’s Guide that you can use to dive into the details of using the library.

Let’s start using these Python statistics libraries!

Calculating Descriptive Statistics in Python

Python statistical modules provide simple and effective techniques for interacting with data.

Let’s get our hands filthy by implementing these libraries and techniques in Python.

1. Measures of Central Tendency

 a. Mean

import statistics
# initializing list
li = [1, 2, 3, 3, 2, 2, 2, 1]
# using mean() to calculate average of list
# elements
print ("The average of list values is : ",end="")
print (statistics.mean(li))

Output:

The average of list values is : 2

b. Median

from statistics import median
from fractions import Fraction as fr
data1 = (2, 3, 4, 5, 7, 9, 11)
# tuple of floating point values
data2 = (2.4, 5.1, 6.7, 8.9)
# tuple of fractional numbers
data3 = (fr(1, 2), fr(44, 12), fr(10, 3), fr(2, 3))
data4 = (-5, -1, -12, -19, -3)
data5 = (-1, -2, -3, -4, 4, 3, 2, 1)
# Printing the median of above datasets
print("Median of data-set 1 is % s" % (median(data1)))
print("Median of data-set 2 is % s" % (median(data2)))
print("Median of data-set 3 is % s" % (median(data3)))
print("Median of data-set 4 is % s" % (median(data4)))
print("Median of data-set 5 is % s" % (median(data5)))

Output:

Median of data-set 1 is 5
Median of data-set 2 is 5.9
Median of data-set 3 is 2
Median of data-set 4 is -5
Median of data-set 5 is 0.0

c. Mode

from statistics import mode
from fractions import Fraction as fr
# tuple of positive integer numbers
data1 = (2, 3, 3, 4, 5, 5, 5, 5, 6, 6, 6, 7)
# tuple of a set of floating point values
data2 = (2.4, 1.3, 1.3, 1.3, 2.4, 4.6)
# tuple of a set of fractional numbers
data3 = (fr(1, 2), fr(1, 2), fr(10, 3), fr(2, 3))
# tuple of a set of negative integers
data4 = (-1, -2, -2, -2, -7, -7, -9)
# tuple of strings
data5 = ("red", "blue", "black", "blue", "black", "black", "brown")
# Printing out the mode of the above data-sets
print("Mode of data set 1 is % s" % (mode(data1)))
print("Mode of data set 2 is % s" % (mode(data2)))
print("Mode of data set 3 is % s" % (mode(data3)))
print("Mode of data set 4 is % s" % (mode(data4)))
print("Mode of data set 5 is % s" % (mode(data5)))

Output:

Mode of data set 1 is 5
Mode of data set 2 is 1.3
Mode of data set 3 is 1/2
Mode of data set 4 is -2
Mode of data set 5 is black

2. Measure of variability

a. Range

# Sample Data
arr = [1, 2, 3, 4, 5]
#Finding Max
Maximum = max(arr)
# Finding Min
Minimum = min(arr)
# Difference Of Max and Min
Range = Maximum-Minimum
print("Maximum = {}, Minimum = {} and Range = {}".format(
 Maximum, Minimum, Range))

Output:

Maximum = 5, Minimum = 1 and Range = 4

b. Variance

# Python code to demonstrate variance()
# function on varying range of data-types
# importing statistics module
from statistics import variance
# importing fractions as parameter values
from fractions import Fraction as fr
# tuple of a set of positive integers
# numbers are spread apart but not very much
sample1 = (1, 2, 5, 4, 8, 9, 12)
# tuple of a set of negative integers
sample2 = (-2, -4, -3, -1, -5, -6)
# tuple of a set of positive and negative numbers
# data-points are spread apart considerably
sample3 = (-9, -1, -0, 2, 1, 3, 4, 19)
# tuple of a set of fractional numbers
sample4 = (fr(1, 2), fr(2, 3), fr(3, 4),
fr(5, 6), fr(7, 8))
# tuple of a set of floating point values
sample5 = (1.23, 1.45, 2.1, 2.2, 1.9)
# Print the variance of each samples
print("Variance of Sample1 is % s " % (variance(sample1)))
print("Variance of Sample2 is % s " % (variance(sample2)))
print("Variance of Sample3 is % s " % (variance(sample3)))
print("Variance of Sample4 is % s " % (variance(sample4)))
print("Variance of Sample5 is % s " % (variance(sample5)))

Output:

Variance of Sample1 is 15.80952380952381
Variance of Sample2 is 3.5
Variance of Sample3 is 61.125
Variance of Sample4 is 1/45
Variance of Sample5 is 0.17613000000000006

c. Standard Deviation

from statistics import stdev
# importing fractions as parameter values
from fractions import Fraction as fr
# creating a varying range of sample sets
# numbers are spread apart but not very much
sample1 = (1, 2, 5, 4, 8, 9, 12)
# tuple of a set of negative integers
sample2 = (-2, -4, -3, -1, -5, -6)
# tuple of a set of positive and negative numbers
# data-points are spread apart considerably
sample3 = (-9, -1, -0, 2, 1, 3, 4, 19)
# tuple of a set of floating point values
sample4 = (1.23, 1.45, 2.1, 2.2, 1.9)
# Print the standard deviation of
# following sample sets of observations
print("The Standard Deviation of Sample1 is % s"
 % (stdev(sample1)))
print("The Standard Deviation of Sample2 is % s"
 % (stdev(sample2)))
print("The Standard Deviation of Sample3 is % s"
 % (stdev(sample3)))
print("The Standard Deviation of Sample4 is % s"
 % (stdev(sample4)))

Output:

The Standard Deviation of Sample1 is 3.9761191895520196
The Standard Deviation of Sample2 is 1.8708286933869707
The Standard Deviation of Sample3 is 7.8182478855559445
The Standard Deviation of Sample4 is 0.4196784483387

3. Summary of Descriptive Statistics

SciPy and Pandas provide useful techniques for obtaining descriptive statistics rapidly with a single function or method call. You can use scipy.stats.describe() in the following way:

>>> result = scipy.stats.describe(y, ddof=1, bias=False)
>>> result
DescribeResult(nobs=9, minmax=(-5.0, 41.0), mean=11.622222222222222, variance=228.75194444444446, skewness=0.9249043136685094, kurtosis=0.14770623629658886)

The dataset must be provided as the first input. A NumPy array, list, tuple, or equivalent data structure can be used as the parameter. You can omit ddof=1 because it is the default and solely affects the variance calculation. Pass bias=False to force statistical bias correction of skewness and kurtosis.

describe() returns an object that holds the following descriptive statistics:

  1. nobs: the number of observations or elements in your dataset
  2. minmax: the tuple with the minimum and maximum values of your dataset
  3. mean: the mean of your dataset
  4. variance: the variance of your dataset
  5. skewness: the skewness of your dataset
  6. kurtosis: the kurtosis of your dataset

You can access particular values with dot notation:

>>> result.nobs
9
>>> result.minmax[0] # Min
-5.0
>>> result.minmax[1] # Max
41.0
>>> result.mean
11.622222222222222
>>> result.variance
228.75194444444446
>>> result.skewness
0.9249043136685094
>>> result.kurtosis
0.14770623629658886

A descriptive statistics summary for your dataset is simply one function call away with SciPy.

Pandas has similar, if not better, functionality. Series objects have the method .describe():

>>> result = z.describe()
>>> result
count 9.000000
mean 11.622222
std 15.124548
min -5.000000
25% 0.100000
50% 8.000000
75% 21.000000
max 41.000000
dtype: float64

It returns a new Series that holds the following:

  1. count: the number of elements in your dataset
  2. mean: the mean of your dataset
  3. std: the standard deviation of your dataset
  4. min and max: the minimum and maximum values of your dataset
  5. 25%, 50%, and 75%: the quartiles of your dataset

If you want the resulting Series object to contain other percentiles, then you should specify the value of the optional parameter percentiles. You can access each item of result with its label:

>>> result['mean']
11.622222222222222
>>> result['std']
15.12454774346805
>>> result['min']
-5.0
>>> result['max']
41.0
>>> result['25%']
0.1
>>> result['50%']
8.0
>>> result['75%']
21.0

That’s how you can get descriptive statistics of a Series object with a single method call using Pandas.

4. Measures of Correlation Between Pairs of Data

You'll frequently need to investigate the link between two variables' corresponding elements in a dataset. Assume you have two variables, x and y, each with an equal number of elements, n. Let x1 from x corresponding to y1 from y, x2 from x correspond to y2 from y, and so on. Then you can say there are n pairs of corresponding elements: (x1, y1), (x2, y2), and so on.

You'll notice the following correlation measures between pairs of data:

A positive correlation exists when higher x values correspond to higher y values and vice versa.

When bigger values of x correlate to smaller values of y and vice versa, there is a negative correlation.

If there is no obvious association, there is a weak or no correlation.

measures of Correlation Between Pairs of Data

The plot with red dots on the left demonstrates a negative correlation. The plot with the green dots in the centre demonstrates a weak association. Finally, the figure with blue dots on the right demonstrates a positive association.

Covariance and the correlation coefficient are two statistics that assess the correlation between datasets. Let's create some statistics to go along with these metrics. You'll build two Python lists and utilise them to get NumPy arrays and Pandas Series:

>>> x = list(range(-10, 11))>>> y = [0, 2, 2, 2, 2, 3, 3, 6, 7, 4, 7, 6, 6, 9, 4, 5, 5, 10, 11, 12, 14]>>> x_, y_ = np.array(x), np.array(y)>>> x__, y__ = pd.Series(x_), pd.Series(y_)

a. Covariance

The sample covariance is a quantitative assessment of the intensity and direction of a relationship between two variables:

If the correlation is positive, the covariance is also positive. A higher covariance value indicates a stronger association.

If the correlation is negative, the covariance is also negative. A stronger association is represented by a lower (or higher absolute) covariance value.

The covariance is close to zero when the correlation is weak.

This is how you can calculate the covariance in pure Python:

>>> n = len(x)
>>> mean_x, mean_y = sum(x) / n, sum(y) / n
>>> cov_xy = (sum((x[k] - mean_x) * (y[k] - mean_y) for k in range(n))
... / (n - 1))
>>> cov_xy
19.95

First, you have to find the mean of x and y. Then, you apply the mathematical formula for the covariance.

NumPy has the function cov() that returns the covariance matrix:

>>> cov_matrix = np.cov(x_, y_)
>>> cov_matrix
array([[38.5 , 19.95 ],
[19.95 , 13.91428571]])

b. Coefficient of Correlation

The symbol r represents the correlation coefficient, often known as the Pearson product-moment correlation coefficient. The coefficient is yet another measure of data correlation. Consider it to be a standardised covariance. Here are some key facts regarding it:

The value

Working With 2D Data in Python

2D Data Processing

Statisticians frequently work with two-dimensional data. Here are some 2D data format examples:

  • Tables in a database
  • CSV documents

In addition to Excel, Calc, and Google spreadsheets, NumPy and SciPy offer a comprehensive way to interact with 2D data. Pandas provide a class called DataFrame specifically designed to handle 2D labelled data.

1. Axes

Begin by making a 2D NumPy array:

>>> a = np.array([[1, 1, 1],
... [2, 3, 1],
... [4, 9, 2],
... [8, 27, 4],
... [16, 1, 1]])
>>> a
array([[ 1, 1, 1], 
[ 2, 3, 1], 
[ 4, 9, 2], 
[ 8, 27, 4], 
[16, 1, 1]])

You now have a 2D dataset to work with in this part. You can use Python statistics functions and techniques on it in the same way that you would on 1D data:

>>> np.mean(a)
5.4
>>> a.mean()
5.4
>>> np.median(a)
2.0>>> a.var(ddof=1)
53.40000000000001

The functions and methods you've used so far include one optional argument called axis, which is essential when working with 2D data. Axis can have any of the following values:

a. axis=None instructs the programme to compute statistics across all data in the array. This is how the examples above operate. In NumPy, this is frequently the default behaviour.

1. axis=0 instructs the programme to compute statistics across all rows, or for each column of the array. This is frequently the default behaviour of SciPy statistical functions.

  • axis=1says to calculate the statistics across all columns, that is, for each row of the array.

Let’s see axis=0 in action with np.mean():

>>> np.mean(a, axis=0)
array([6.2, 8.2, 1.8])
>>> a.mean(axis=0)
array([6.2, 8.2, 1.8])

The two statements above return new NumPy arrays with the mean for each column of a. In this example, the mean of the first column is 6.2. The second column has the mean 8.2, while the third has 1.8.

If you provide axis=1 to mean(), then you’ll get the results for each row:

>>> np.mean(a, axis=1)
array([ 1., 2., 5., 13., 6.])
>>> a.mean(axis=1)
array([ 1., 2., 5., 13., 6.])

As you can see, the first row of a has the mean 1.0, the second 2.0, and so on.

2. DataFrames

One of the fundamental Pandas data types is the DataFrame class. It's incredibly easy to use because it provides labels for rows and columns. Create a DataFrame with the array a:

>>> row_names = ['first', 'second', 'third', 'fourth', 'fifth']
>>> col_names = ['A', 'B', 'C']
>>> df = pd.DataFrame(a, index=row_names, columns=col_names)
>>> df
 A B C
first 1 1 1
second 2 3 1
third 4 9 2
fourth 8 27 4
fifth 16 1 1

Though the functionality differs, DataFrame methods are fairly similar to Series methods. When you invoke Python statistics methods without any arguments, the DataFrame will return the following results for each column:

>>> df.mean()
A 6.2
B 8.2
C 1.8dtype: float64
>>> df.var()
A 37.2
B 121.2
C 1.7dtype: float64

DataFrame objects, like Series, have the method. description() provides another DataFrame with a summary of all columns' statistics using Python summary statistics:

>>> df.describe()
 A B C
count 5.00000 5.000000 5.00000
mean 6.20000 8.200000 1.80000
std 6.09918 11.009087 1.30384
min 1.00000 1.000000 1.00000
25% 2.00000 1.000000 1.00000
50% 4.00000 3.000000 1.00000
75% 8.00000 9.000000 2.00000
max 16.00000 27.000000 4.00000

The Python summary statistics contains the following results:

  1. count: the number of items in each column
  2. mean: the mean of each column
  3. std: the standard deviation
  4. min and max: the minimum and maximum values
  5. 25%, 50%, and 75%: the percentiles

To learn more about these awesome methods in data science, you must have a look at Data Science Professional Certificate.

Visualizing Data in Python

In addition to calculating numerical numbers such as mean, median, and variance, visual methods can be used to show, describe, and summarise data. In this section, you will learn how to visually exhibit your data using the graphs listed below:

  1. Plots in boxes
  2. Histograms
  3. Pie graphs
  4. Bar graphs
  5. X-Y diagrams
  6. Heatmaps

Although matplotlib.pyplot is a very useful and commonly used library, it is not the only Python library available for this purpose. You can import it as follows:

>>> import matplotlib.pyplot as plt
>>> plt.style.use('ggplot')

Pseudo-random numbers will be used to produce data. This section does not need prior understanding of random numbers. You simply need some arbitrary numbers, and pseudo-random number generators may help you get them. The np.random package creates pseudo-random number arrays:

np.random.randn generates normally distributed numbers ().
np.random.randint generates uniformly distributed integers ().

1. Box Plots

 The box plot is an effective tool for visually showing descriptive statistics in a given dataset. You may see the range, interquartile range, median, mean, outliers, and all quartiles. First, gather some data to depict using a box plot:

>>> np.random.seed(seed=0)
>>> x = np.random.randn(1000)
>>> y = np.random.randn(100)
>>> z = np.random.randn(10)

 The first phrase utilises seed() to establish the seed of the NumPy random number generator, guaranteeing that the results are consistent each time the code is executed. You are not need to set the seed, but if you do, the outcomes will vary each time.

The remaining commands generate three NumPy arrays of pseudo-random integers with a normally distributed distribution. x represents a 1000-item array, y represents a 100-item array, and z represents a 10-item array. You can apply now that you have the necessary details. boxplot() yields a box plot:

fig, ax = plt.subplots()
ax.boxplot((x, y, z), vert=False, showmeans=True, meanline=True,
 labels=('x', 'y', 'z'), patch_artist=True,
 medianprops={'linewidth': 2, 'color': 'purple'},
 meanprops={'linewidth': 2, 'color': 'red'})
plt.show()

The parameters of .boxplot() define the following:

x represents your data.

When False, vert sets the plot orientation to horizontal. Vertical is the default orientation.

When True, showmeans displays the mean of your data.

When True, meanline represents the mean as a line. A point is the default representation.

labels: your data's labels

patch artist specifies how the graph is drawn.

The term medianprops refers to the qualities of the line that represents the median.

meanprops denotes the attributes of the mean-representing line or dot.

The code above generates the following image:

box plots

There are three box plots visible. Each one corresponds to a single dataset (x, y, or z) and demonstrates the following:

  • The mean is shown by the red dashed line.
  • The purple line represents the median.
  • The left border of the blue rectangle represents the first quartile.
  • The right border of the blue rectangle represents the third quartile.
  • The length of the blue rectangle is the interquartile range.
  • Everything from left to right is included in the range.
  • The dots on the left and right are outliers.]

2. Histograms

Histograms are very useful when a dataset has many unique values. The histogram separates a sorted dataset's values into intervals known as bins. All bins are frequently of similar width, but this is not always the case. The bin edges are the values of a bin's bottom and upper boundaries.

Each bin is allocated a single frequency value. It is the number of elements in the dataset that have values between the edges of the bin. All save the rightmost bin are, as is customary, half-open. They include values equal to the lower borders but omit values equal to the upper boundaries. The rightmost bin is closed since it encompasses both borders. If you split a dataset with the bin edges 0, 5, 10, and 15, you get three bins.

  • The values larger than or equal to 0 and less than 5 are found in the first and leftmost bin.
  • The numbers more than or equal to 5 and less than 10 are in the second bin.
  • The values larger than or equal to 10 and less than or equal to 15 are found in the third and rightmost bin.

The method np.histogram() provides an easy way to obtain data for histograms:

>>> hist, bin_edges = np.histogram(x, bins=10)
>>> hist
array([ 9, 20, 70, 146, 217, 239, 160, 86, 38, 15])
>>> bin_edges
array([-3.04614305, -2.46559324, -1.88504342, -1.3044936 , -0.72394379,
-0.14339397, 0.43715585, 1.01770566, 1.59825548, 2.1788053 ,
2.75935511])

It accepts your data array and the number of bins (or edges) and returns two NumPy arrays:

  • hist contains the frequency or quantity of items associated with each bin.
  • bin edges holds the bin's edges or boundaries.

What histogram() computes, .hist() can display the following graph:

fig, ax = plt.subplots()
ax.hist(x, bin_edges, cumulative=False)
ax.set_xlabel('x')
ax.set_ylabel('Frequency') 
plt.show()

histograms

3. Pie Charts

Pie charts show data with a small number of labels and relative frequencies. They work effectively with labels that cannot be sorted (like nominal data). A pie chart is a circle divided into multiple sections. Each slice in the dataset corresponds to a single label and has an area proportional to the label's relative frequency.

Let us define data that is associated with three labels:

>>> x, y, z = 128, 256, 1024

Now, create a pie chart with .pie():

fig, ax = plt.subplots()
ax.pie((x, y, z), labels=('x', 'y', 'z'), autopct='%1.1f%%')
plt.show()

pie charts


The first input to.pie() is your data, and the second is the sequence of labels. The format of the relative frequencies depicted in the figure is defined by autopct. You should receive something like this:

4. Bar Charts

Bar charts can also show data that corresponds to labels or discrete numeric values. They can display data pairs from two datasets. Labels are represented by items in one group, while frequencies are represented by things in the other. They can also display the faults associated with the frequencies if desired.

The bar chart displays parallel rectangles known as bars. Each bar represents a single label and has a height proportionate to its frequency or relative frequency. Let's make three datasets of 21 items each:

>>> x = np.arange(21)
>>> y = np.random.randint(21, size=21)
>>> err = np.random.randn(21)

To obtain x, or an array of consecutive integers ranging from 0 to 20, use np.arange(). This will be used to symbolise the labels. y is an array of uniformly distributed random integers ranging from 0 to 20. The frequencies will be represented by this array. The errors are represented by properly distributed floating-point numbers in err. These parameters are optional.

You may make a bar chart using.bar() for vertical bars or.barh() for horizontal bars:

fig, ax = plt.subplots())
ax.bar(x, y, yerr=err)
ax.set_xlabel('x')
ax.set_ylabel('y')
plt.show()

This code should yield the following result:


>>> x = np.arange(21)
>>> y = 5 + 2 * x + 2 * np.random.randn(21)>>> slope, intercept, r, *__ = scipy.stats.linregress(x, y)
>>> line = f'Regression line: y={intercept:.2f}+{slope:.2f}x, r={r:.2f}'

The dataset x is once again an array of integers ranging from 0 to 20. y is determined as a linear function of x that has been corrupted with random noise.

linregress delivers several results. You'll need the regression line's slope and intercept, as well as the correlation coefficient r. You can then apply. To obtain an x-y plot, use plot():

fig, ax = plt.subplots()ax.plot(x, y, linewidth=0, marker='s', label='Data points')
ax.plot(x, intercept + slope * x, label=line)
ax.set_xlabel('x')
ax.set_ylabel('y')
ax.legend(facecolor='white')
plt.show()

x y plots

5. Heatmaps

A heatmap can be used to display a matrix visually. The colors represent the matrix's numbers or elements. Heatmaps are especially useful for displaying covariance and correlation matrices. .imshow() can be used to generate a heatmap for a covariance matrix:

matrix = np.cov(x, y).round(decimals=2)
fig, ax = plt.subplots()
ax.imshow(matrix)
ax.grid(False)
ax.xaxis.set(ticks=(0, 1), ticklabels=('x', 'y'))
ax.yaxis.set(ticks=(0, 1), ticklabels=('x', 'y'))
ax.set_ylim(1.5, -0.5)
for i in range(2):
 for j in range(2):
 ax.text(j, i, matrix[i, j], ha='center', va='center', color='w')
plt.show()

Here, the heatmap contains the labels 'x' and 'y' as well as the numbers from the covariance matrix. You’ll get a figure like this:

heatmaps

The yellow field corresponds to the matrix's greatest element, 130.34, while the purple field corresponds to the matrix's lowest element, 38.5. The blue squares in between represent the value 69.9.

The heatmap for the correlation coefficient matrix may be obtained using the same logic:

matrix = np.corrcoef(x, y).round(decimals=2)
fig, ax = plt.subplots()
ax.imshow(matrix)
ax.grid(False)
ax.xaxis.set(ticks=(0, 1), ticklabels=('x', 'y'))
ax.yaxis.set(ticks=(0, 1), ticklabels=('x', 'y'))
ax.set_ylim(1.5, -0.5)
for i in range(2):
for j in range(2):
ax.text(j, i, matrix[i, j], ha='center', va='center', color='w')
plt.show()

The result is the figure below:

heatmap


The yellow color symbolizes the number 1.0, whereas the purple color indicates the value 0.99.

Conclusion

You now understand the quantities that describe and summarise datasets, as well as how to compute them in Python. It is feasible to obtain descriptive statistics using only Python code. However, this is rarely required. Typically, you'll use one of the libraries designed specifically for this purpose:

  1. For the most significant Python statistics functions, use statistics.
  2. To effectively handle arrays, use NumPy.
  3. For further Python statistics functions for NumPy arrays, use SciPy.
  4. To work with labeled datasets, use Pandas.
  5. Matplotlib can be used to visualize data via plots, charts, and histograms.

You must know how to calculate descriptive statistics measures in the age of big data and artificial intelligence. You're now prepared to delve even further into the world of data science and machine learning by enrolling into If you have any questions or comments, please leave them in the space below.

Statistics for Data Science with Python FAQs

1Can you use Python for statistics?

Yes, Absolutely as Python prioritizes simplicity and readability while simultaneously offering a wealth of relevant choices for data analysts/scientists. As a result, even inexperienced programmers may readily use its comparatively basic syntax to design effective solutions for complex circumstances with just a few lines of code.

Python's built-in analytics tools make it ideal for processing large amounts of data. In addition to other essential matrices in measuring performance, Python's built-in analytics tools may easily explore patterns, correlate information in large quantities, and deliver greater insights.

2What statistics do you need for data science?

At the very least, data analysis necessitates descriptive statistics and probability theory. These ideas will assist you in making better business decisions based on data. Probability distributions, statistical significance, hypothesis testing, and regression are all important concepts.

Furthermore, knowing Bayesian thinking is required for machine learning. Bayesian reasoning is the act of updating beliefs as new data is gathered, and it is at the heart of many machine learning algorithms. Conditional probability, priors and posteriors, and maximum likelihood are all important topics.    

3Is Python as good as R for statistics?

Both can handle almost any data analysis work and are regarded reasonably simple languages to learn, particularly for novices. When it comes to learning Python or R, there is no wrong decision. Both are in-demand talents that will enable you to complete almost any data analytics work you come across. Which one is best for you will ultimately depend on your history, interests, and professional objectives.

Python is good for Dealing with large amounts of data, Graphic design and data visualization, constructing deep learning models Developing statistical models. Non-statistical operations such as web scraping, database storing, and process execution. While R is good for its large ecosystem of statistical packages.

4What percentage of data scientists use Python?

In 2018, 66% of data scientists reported using Python every day, making Python the most popular data science language!

Profile

Rohit Verma

Author

I am currently pursuing an engineering degree in data science and AI. worked on projects involving data science and full-stack web development (MERN). writes articles on web and data science technologies with passion.

Share This Article
Ready to Master the Skills that Drive Your Career?

Avail your free 1:1 mentorship session.

Select
Your Message (Optional)

Upcoming Data Science Batches & Dates

NameDateFeeKnow more