Exploratory Data Analysis (EDA) using Python

Exploratory Data Analysis(EDA) is a way to better understand your data, drive business insights from data and generate descriptive reports and summaries.

In this tutorial we will be discussing various ways how to explore the data we get, I will be showing different methods to explore the data that will help you to generate your own process to explore the data with further problem statements you work with.

EDA is a key to prove and generate a hypothesis, test various business assumptions, understand change factors in data which helps you with further analyzing your data with feature engineering. There are various tools and techniques to understand your data, and the basic need is you should have basic knowledge of NumPy and pandas which is used to perform statistical operations and data manipulation

We will use a very famous Titanic dataset to study data exploration methods and for some purpose, we will also use an inbuilt dataset as tips.

If you have some knowledge of data visualization with Matplotlib and seaborn so it will be easy for you to catch each method.

Basic Description About Data

Let's get started with exploring the data. First, we need to import the libraries and data. You can download a titanic dataset from here.

There are the basic seven-question you must ask before exploring the data, so let's check them out one by one.

Que-1) How large is the data?

It is a simple question which you must see that how many rows and columns do you have, which provides you the understanding that with how much large dataset you are working with to use the techniques that are applicable for large and small datasets. You can look at it using the shape function which returns a number of rows and columns.

Que-2) How Does Your data look like?

Just have a sample lookover to data and This we all do as well. For this, you can simply have a look at the head or tail which gives the first five and last five rows.

one more way is to use the sample function.

Que-3) What are the Data Types of columns?

Understand the range of data types features you have or how many features of particular data types are there which help to have some separate analysis of numerical and categorical variables.

Que-4) Are there any missing values in the data?

you can also have an answer to this question from the previous answer only nut when you have a large dataset then it's complex to look over that so you can use the IsNull function.

Que-5) How does Data Look Mathematically?

Understanding basic statistics like mean, minimum, maximum values gives you a better insight to further analyze any feature. This method works only on numerical columns but you should do this.

Que-6) Are there any Duplicated Values?

It is important to get a notice that the data with which you are working contains duplicates values or not. If it contains then you can drop it if it is of no use to you or you can conduct basic analysis.

Que-7) How is Correlation between columns?

It is nice to find the correlation values before EDA which gives you a value that how much positively or negatively one feature is related to another that helps to further visualize that relationship.

Most of them see correlation during feature selection task but its good to see it multiple times that after some analysis is there a change in the relationship

Now that we have seen a basic understanding of your data and from here actual analysis starts to better understand your data which actually refers to an EDA.

Under the EDA we find univariate, Bivariate, and Multivariate analysis and we will be seeing many methods for the same.

The article can be a little bit large but It will present many and almost techniques used for EDA.

UNIVARIATE ANALYSIS

We will use seaborn and matplotlib for doing univariate analysis

1) Categorical Data

First, we will do univariate on categorical data using various plots

a) Countplot

It basically gives you the value counts of pandas and it plots a bar chart of them. You can do this on any column but used it for categorical. It can answer most questions like which categories are more than other and work on it to find the reason.

b) Pie Chart

It is also the same as the countplot used for categorical data and gives one additional data in terms of percentage that which category is getting how much weightage in data. let's check about sex that how many male and female members were traveling.

2) Numerical Data

In univariate analysis analyzing studying numerical data is important so let's see different ways to understand it.

a) Histogram

It creates bins in various ranges in a numerical column. And we plot each bin which is known as Histogram. Basically, we can treat the Numerical column as Categorical under this particular plot. we can understand the distribution of data and where do more values lie, positive, negative, or at normal distribution.

b) Distplot

It is an improvement of a histogram or we can say it as another kind of histogram. Only the difference is it gives us KDE(kernel density estimation) and behind it there is a histogram and this graph gives us a probability of any number coming which is referred to as PDF(probability density function).

c) BoxPlot

It is a very interesting plot that gives us a 5 Number summary. To understand boxplot you need to understand some terms.

Median - middle value of a series
Percentile - any number which shows how many values are below it in data.
Minimum and Maximum - here minimum and maximum are not minimum and maximum value in data, it is something which we calculate as IQR(interquartile range).

Minimum = Q1 - 1.5 * IQR

Maximum = Q3 + 1.5 * IQR

Q1 and Q3 are 25th and 75th percentile also referred to as quantile.

There are more plots you can draw like violin plots, rug plots but they are not usually used in each dataset.

Now e are moving forward with very important analysis as Bivariate and Multivariate analysis.

Bivariate/MultiVariate Analysis

If we are doing analysis of 2 plots together referred to as bivariate and on more than 2 known as Multivariate Analysis. here we will also use the tips dataset to load the tips dataset from the seaborn library.

1) Scatter Plot (Numerical - Numerical)

Where we have both the feature as numerical we use a scatter plot to get the relationship between them. here we have to see the total bill with the tip so we can use a scatter plot which is known as bivariate analysis.

we can also do multivariate analysis with scatter plot, suppose we want to see the separate ratio of male and female with a total bill to tip, This is Multivariate analysis with 3 variables.

Multivariate analysis with 4 variables with scatter plot can also be done along with gender I also want to see that a person was a smoker or not with tip and total bill ratio.

2) Bar Plot (Numerical - Categorical)

It is used to compare the Numerical variable with Categorical data, we can see that in particular categories what is the range. The blacktip at the top of the barplot shows the confidence Interval.

Let's see the relationship of each P-class with age.

here also we can do multivariate analysis, suppose we want to find according to gender what is the average fare in each P-class. You can see that fare of females is higher in every class.

3) Boxplot (Numerical - Categorical)

We can plot separate boxplot for each gender with age.

Here also we have multivariate analysis with hue parameter so let's see what is survival rate with gender in an age range.

4) Distplot(Numerical - Categorical)

Distplot does not have a hue parameter but we can create it. suppose we want to see the probability of people with an age range that of which age range people survival probability is high to the age range of death ratio.

It is a very interesting graph. The Blue one is indicating the Probability of dying and the orange one is the probability of surviving so if you see carefully that the probability of children(age below 15) surviving is high then the probability of dying and which is logically also true.

So, If you are performing a deep analysis you can find this story and these are the hidden patterns that you include in your data story.

5) HEATMAP (Categorical - Categorical)

Now we will work on categorical with a categorical column. In this case, we can use a heatmap to find that which categories how much presence of another category is there. It is the same as we find using the crosstab function in pandas.

Suppose I want to find in each P-Class how many peoples survived and died.

6) Clustermap (Categorical - Categorical)

We can also make a cluster map to identify the relationship of the variables between categorical columns which basically builds a Dendrogram which is in a hierarchical tree that shows which categories show the same behavior.

There are some more plots as pair plot, joint plot, line plot which can also be used for framing some analysis.

SUMMARY

EDA is a key tool to perform better in any Machine learning competition, Solving real-world problem statements to build a well-generalized Model and you could show all your analysis and summary reports that why you have done so by looking at your data.

I hope that it was easy to catch up till here, If you have any queries please post them in the comment section and if you want to add any new technique that should be listed then please comment it down. I know the post was a little bit bigger but these concepts are necessary to understand data in a better way.

Thank You!

😊 Keep learning, Happy Learning 😊