Pandas Inbuilt Profile Report Generation

Anchit Bhagat
4 min readJan 14, 2021

A new and effective way of exploring and analyzing data

As a Data Scientist, it becomes very important on your part not only to work towards achieving the desired result but also able to understand it. This analyses has to be effectively communicated to your stake holders. Exploratory Data Analyses plays an important role to grab the basic understanding of the data.

In EDA, we perform all the necessary tasks to extract the relevant information out of our data which ranges from performing the tasks related to—

  1. Finding the Missing/ Null / Nan values
  2. Performing Statistical Analysis (through describe function)
  3. Observing the Correlation of one feature with other features / target feature
  4. Plotting & visualizing the variables
  5. Differentiating between classes of features (continuous, categorical or numerical)

EDA sometimes becomes a really tedious task. Specially in cases where a given set of data gets changed may be due to addition / modification introduced at the very last moment. Pandas have an inbuilt library by the name of Profiling Report which generates an automated report performing all the necessary analyses. An excellent feature which presents us with a beautiful representation of the necessary descriptive and statistical analysis which requires a certain amount of time if carried out individually.

  • Loading & Installing pandas profiling library

To carry out the analysis, one is first required to install the library which can be done by the following command

!pip install pandas_profiling
from pandas_profiling import ProfileReport
  • Generating report into iframe notebook

Once library is loaded, you can run the following command to generate the report for ‘data’

out = ProfileReport(data)
out.to_notebook_iframe()

Pandas Profiling Report consists of several inbuilt features that provides a descriptive analysis of the numerical and categorical elements. One by one these features have been shown below.

1. Overview

This is the first section of the report which gives the entire dataset stats like the number of variables, total number of missing values with the percentage, alongwith the variable types (categorical or numerical).

2. Variable

This section gives a vast amount of info. in the above case, where the variable type is a character, it shows the following description. By clicking on the ‘toggle’ button on the right corner opens further details. We can see even the categories of the different words with their total count.

For numerical data types, it provides the descriptive statistics like mean, median, skewness, etc. Also, we can view the entire Quantile stats like 25%, 50%, 75%, max, range & even the IQR without doing any mathematical calculation.

3. Interactions

One of the specialities of interaction tab is that how any two features are interrelated. The interactions between features are shown through scatter plots. Here interactions are shown between the ‘Year’ & ‘IMDb’ feature.

3. Correlation

By clicking on Correlation, gives a correlation matrix of the features. One can view different Correlations like Pearson, Spearman, Phik and Kendall alongwith their brief description.

4. Missing Values

This column shows the total count of the variables/ features which are missing. Additionally, it provides Heatmaps & Dendograms for the observed dataset.

5. Sample

The last column is Sample that performs a similar function like head() & tail() where we can observe the first & last few columns.

  • Saving the report file

The Profiling Report can be saved for later use with the following command

data.to_file(“file name.html”)

The report output will be saved in an html format.

Pandas Profiling is one of the easiest way to perform EDA while consuming very little time.

--

--

Anchit Bhagat

Data Analyst l Energy Trader | Travel-enthusiast | Aquarian