Exploratory Data Analysis — Retail using Python

4 min readJan 6, 2021

We will Visualize a Superstore Dataset related to different categories of Office Supplies, Furniture & Technological products, by conducting EDA on the dataset through the use of scatter plots & heatmaps. After that, we will try to figure out the variables which are directly / indirectly related to Profit variable.

Read, Load & Understand the Data

#Load packages
# Read the file & display first 5 rowsimport pandas as pd
import numpy as np
import matplotlib.pyplot as plt6
import seaborn as sns
import warnings
warnings.filterwarnings(“ignore”, category=DeprecationWarning)
warnings.filterwarnings(“ignore”, category=FutureWarning)super_store = pd.read_csv(“D:/ANCHIT/Python/SampleSuperstore.csv”)
super_store.head()

# Rows containing duplicate data
duplicate_rows_super_store = super_store[super_store.duplicated()]
print("number of duplicate rows: ", duplicate_rows_super_store.shape)# Dropping the duplicates 
super_store = super_store.drop_duplicates()# Check for any null values in the dataset
super_store.isnull().sum()# Rounding off the decimal values upto 2 decimal places for Sales & Profit 
# Copying on new variable of Sales_r & Profit_r respectivelysuper_store['Sales_r'] = round(super_store['Sales'], 2)
super_store['Profit_r'] = round(super_store['Profit'], 2)
cols = ['Sales', 'Profit']
super_store.drop(cols, axis=1, inplace=True)# Below variable doesn't yield much value, hence dropping them
super_store.drop(['Postal Code'], axis=1, inplace=True)#Renaming the column
super_store.rename(columns = {'Ship Mode':'Ship_Mode'}, inplace = True) 
super_store.head()

We will also perform certain statistical inferences on the data (describe command), check the type of variables (info command), observe the shape & columns list (detailed code in my github account).

Outlier Detection

#Checking for outliers
plt.figure(figsize=[10,5])
sns.set(style=”whitegrid”)
ax = sns.boxplot(data=super_store)
plt.show()

There are number of Outliers in Sales & Profit. Hence we need to remove them

def remove_outlier(col):
 sorted(col)
 Q1, Q3 = col.quantile([.25, .75])
 IQR = Q3-Q1
 lower_range=Q1-(1.5+IQR)
 upper_range=Q3+(1.5+IQR)
 return lower_range, upper_rangelowsales, uppsales=remove_outlier(super_store[‘Sales_r’])
super_store[‘Sales_r’] = np.where(super_store[‘Sales_r’] > uppsales, uppsales, super_store[‘Sales_r’])
super_store[‘Sales_r’] = np.where(super_store[‘Sales_r’] < lowsales, lowsales, super_store[‘Sales_r’])lowsales, uppsales=remove_outlier(super_store[‘Profit_r’])
super_store[‘Profit_r’] = np.where(super_store[‘Profit_r’] > uppsales, uppsales, super_store[‘Profit_r’])
super_store[‘Profit_r’] = np.where(super_store[‘Profit_r’] < lowsales, lowsales, super_store[‘Profit_r’])#Let's see if the outliers are removed or not
plt.figure(figsize=[10,5])
sns.set(style="whitegrid")
ax = sns.boxplot(data=super_store)
plt.show()

Exploratory Data Analysis

Analysing the categorical variables

super_store.Category.value_counts(ascending=False)
sns.countplot(x="Category", data=super_store) plt.show()

super_store.Segment.value_counts(ascending=False)
sns.catplot(x="Segment", y="Profit_r", kind='bar', hue='Category', data=super_store)

Though the Category of Office Supplies was more in number, Technology category brought the highest profit in all the 3 Segments

sns.countplot(x=”Segment”, data=super_store)
plt.show()

# CatPlot for State vs Profit
plt.figure(figsize=(15,8))
sns.catplot(x="State", y="Profit_r", data=super_store, kind='bar', height=5, aspect=2, palette="Set1")
plt.xticks(rotation=90, fontsize=13)
plt.yticks(fontsize=10)
plt.show()

We can see Wyoming & Vermont are among the highest Profit States alongwith others like Montana, Rhode Island, Indiana & Minnesota. However, states like Ohio, Texas, Pennsylvania & Illinois are making losses.

super_store.Region.value_counts(ascending=False)
sns.countplot(x="Region", data=super_store)
plt.show()

# Using Lineplot 
df = pd.DataFrame(dict(time=np.arange(500),
                       value=np.random.randn(500).cumsum()))
g = sns.relplot(x="Discount", y="Profit_r", hue = 'Category', kind="line", data=super_store)
g.fig.autofmt_xdate()

We can observe that by giving Discount of 10 %, we can increase our Profit for all the 3 categories. However, giving discount of more than 10% will make our Profit margin decrease and it keeps on decreasing as we increase the discount.
Hence an ideal percant of 10% discount is feasible.

# Plotting all the variables 
sns.pairplot(super_store)
sns.set(style="whitegrid")

As we can see above in the pairplot, there is not much a relation among the variables

#plot correlation matrix
plt.figure(figsize=[20,10])
corr_mat = super_store.corr()
sns.set(style=”whitegrid”)
sns.heatmap(corr_mat, cmap=’coolwarm’, annot=True)

As you can see there seems to be a positive correlation between Sales & Profit. Since it is positive, it means with increase in Sales, Profit increases. There is a negative correlation between Discount & Profit.
However, the correlations are not strong enough for any conclusion to make.

# CatPlot for Sub-Category vs Profit
plt.figure(figsize=(15,8))
sns.catplot(x="Sub-Category", y="Profit_r", data=super_store, kind='bar', height=5, aspect=2, palette="Set1")
plt.xticks(rotation=90, fontsize=13)
plt.yticks(fontsize=10)
plt.show()

In the above plot, we can see that Copiers made the highest profit among all the Sub-Category products. Other categories are Phones, Accessories, Appliances, Machines & Envelopes which had made considerable amount of profit.
However, Tables made a neglible contribution in terms of profit. Other products like Fasteners & Supplies didn’t made much of either.

View the entire code at my github account: https://github.com/Anchit13/AnchitBhagat/blob/main/Super%20Store.ipynb

Exploratory Data Analysis — Retail using Python

Read, Load & Understand the Data

Outlier Detection

Exploratory Data Analysis

Analysing the categorical variables

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Anchit Bhagat

No responses yet