Exploratory Data Analysis — Retail using Python

Anchit Bhagat
4 min readJan 6, 2021

We will Visualize a Superstore Dataset related to different categories of Office Supplies, Furniture & Technological products, by conducting EDA on the dataset through the use of scatter plots & heatmaps. After that, we will try to figure out the variables which are directly / indirectly related to Profit variable.

Read, Load & Understand the Data

#Load packages
# Read the file & display first 5 rows
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt6
import seaborn as sns
import warnings
warnings.filterwarnings(“ignore”, category=DeprecationWarning)
warnings.filterwarnings(“ignore”, category=FutureWarning)
super_store = pd.read_csv(“D:/ANCHIT/Python/SampleSuperstore.csv”)
super_store.head()
# Rows containing duplicate data
duplicate_rows_super_store = super_store[super_store.duplicated()]
print("number of duplicate rows: ", duplicate_rows_super_store.shape)
# Dropping the duplicates
super_store = super_store.drop_duplicates()
# Check for any null values in the dataset
super_store.isnull().sum()
# Rounding off the decimal values upto 2 decimal places for Sales & Profit
# Copying on new variable of Sales_r & Profit_r respectively
super_store['Sales_r'] = round(super_store['Sales'], 2)
super_store['Profit_r'] = round(super_store['Profit'], 2)
cols = ['Sales', 'Profit']
super_store.drop(cols, axis=1, inplace=True)
# Below variable doesn't yield much value, hence dropping them
super_store.drop(['Postal Code'], axis=1, inplace=True)
#Renaming the column
super_store.rename(columns = {'Ship Mode':'Ship_Mode'}, inplace = True)
super_store.head()

We will also perform certain statistical inferences on the data (describe command), check the type of variables (info command), observe the shape & columns list (detailed code in my github account).

Outlier Detection

#Checking for outliers
plt.figure(figsize=[10,5])
sns.set(style=”whitegrid”)
ax = sns.boxplot(data=super_store)
plt.show()

There are number of Outliers in Sales & Profit. Hence we need to remove them

def remove_outlier(col):
sorted(col)
Q1, Q3 = col.quantile([.25, .75])
IQR = Q3-Q1
lower_range=Q1-(1.5+IQR)
upper_range=Q3+(1.5+IQR)
return lower_range, upper_range
lowsales, uppsales=remove_outlier(super_store[‘Sales_r’])
super_store[‘Sales_r’] = np.where(super_store[‘Sales_r’] > uppsales, uppsales, super_store[‘Sales_r’])
super_store[‘Sales_r’] = np.where(super_store[‘Sales_r’] < lowsales, lowsales, super_store[‘Sales_r’])
lowsales, uppsales=remove_outlier(super_store[‘Profit_r’])
super_store[‘Profit_r’] = np.where(super_store[‘Profit_r’] > uppsales, uppsales, super_store[‘Profit_r’])
super_store[‘Profit_r’] = np.where(super_store[‘Profit_r’] < lowsales, lowsales, super_store[‘Profit_r’])
#Let's see if the outliers are removed or not
plt.figure(figsize=[10,5])
sns.set(style="whitegrid")
ax = sns.boxplot(data=super_store)
plt.show()

Exploratory Data Analysis

Analysing the categorical variables

super_store.Category.value_counts(ascending=False)
sns.countplot(x="Category", data=super_store) plt.show()
super_store.Segment.value_counts(ascending=False)
sns.catplot(x="Segment", y="Profit_r", kind='bar', hue='Category', data=super_store)

Though the Category of Office Supplies was more in number, Technology category brought the highest profit in all the 3 Segments

sns.countplot(x=”Segment”, data=super_store)
plt.show()
# CatPlot for State vs Profit
plt.figure(figsize=(15,8))
sns.catplot(x="State", y="Profit_r", data=super_store, kind='bar', height=5, aspect=2, palette="Set1")
plt.xticks(rotation=90, fontsize=13)
plt.yticks(fontsize=10)
plt.show()

We can see Wyoming & Vermont are among the highest Profit States alongwith others like Montana, Rhode Island, Indiana & Minnesota. However, states like Ohio, Texas, Pennsylvania & Illinois are making losses.

super_store.Region.value_counts(ascending=False)
sns.countplot(x="Region", data=super_store)
plt.show()
# Using Lineplot 
df = pd.DataFrame(dict(time=np.arange(500),
value=np.random.randn(500).cumsum()))
g = sns.relplot(x="Discount", y="Profit_r", hue = 'Category', kind="line", data=super_store)
g.fig.autofmt_xdate()

We can observe that by giving Discount of 10 %, we can increase our Profit for all the 3 categories. However, giving discount of more than 10% will make our Profit margin decrease and it keeps on decreasing as we increase the discount.

Hence an ideal percant of 10% discount is feasible.

# Plotting all the variables 
sns.pairplot(super_store)
sns.set(style="whitegrid")

As we can see above in the pairplot, there is not much a relation among the variables

#plot correlation matrix
plt.figure(figsize=[20,10])
corr_mat = super_store.corr()
sns.set(style=”whitegrid”)
sns.heatmap(corr_mat, cmap=’coolwarm’, annot=True)

As you can see there seems to be a positive correlation between Sales & Profit. Since it is positive, it means with increase in Sales, Profit increases. There is a negative correlation between Discount & Profit.

However, the correlations are not strong enough for any conclusion to make.

# CatPlot for Sub-Category vs Profit
plt.figure(figsize=(15,8))
sns.catplot(x="Sub-Category", y="Profit_r", data=super_store, kind='bar', height=5, aspect=2, palette="Set1")
plt.xticks(rotation=90, fontsize=13)
plt.yticks(fontsize=10)
plt.show()

In the above plot, we can see that Copiers made the highest profit among all the Sub-Category products. Other categories are Phones, Accessories, Appliances, Machines & Envelopes which had made considerable amount of profit.

However, Tables made a neglible contribution in terms of profit. Other products like Fasteners & Supplies didn’t made much of either.

View the entire code at my github account: https://github.com/Anchit13/AnchitBhagat/blob/main/Super%20Store.ipynb

--

--

Anchit Bhagat

Data Analyst l Energy Trader | Travel-enthusiast | Aquarian