Exploratory Data Analysis — Retail using Python

We will Visualize a Superstore Dataset related to different categories of Office Supplies, Furniture & Technological products, by conducting EDA on the dataset through the use of scatter plots & heatmaps. After that, we will try to figure out the variables which are directly / indirectly related to Profit variable.

Read, Load & Understand the Data

#Load packages
# Read the file & display first 5 rows
# Rows containing duplicate data
duplicate_rows_super_store = super_store[super_store.duplicated()]
print("number of duplicate rows: ", duplicate_rows_super_store.shape)

We will also perform certain statistical inferences on the data (describe command), check the type of variables (info command), observe the shape & columns list (detailed code in my github account).

Outlier Detection

#Checking for outliers
ax = sns.boxplot(data=super_store)

There are number of Outliers in Sales & Profit. Hence we need to remove them

def remove_outlier(col):
Q1, Q3 = col.quantile([.25, .75])
IQR = Q3-Q1
return lower_range, upper_range

Exploratory Data Analysis

Analysing the categorical variables

sns.countplot(x="Category", data=super_store) plt.show()
sns.catplot(x="Segment", y="Profit_r", kind='bar', hue='Category', data=super_store)

Though the Category of Office Supplies was more in number, Technology category brought the highest profit in all the 3 Segments

sns.countplot(x=”Segment”, data=super_store)
# CatPlot for State vs Profit
sns.catplot(x="State", y="Profit_r", data=super_store, kind='bar', height=5, aspect=2, palette="Set1")
plt.xticks(rotation=90, fontsize=13)

We can see Wyoming & Vermont are among the highest Profit States alongwith others like Montana, Rhode Island, Indiana & Minnesota. However, states like Ohio, Texas, Pennsylvania & Illinois are making losses.

sns.countplot(x="Region", data=super_store)
# Using Lineplot 
df = pd.DataFrame(dict(time=np.arange(500),
g = sns.relplot(x="Discount", y="Profit_r", hue = 'Category', kind="line", data=super_store)

We can observe that by giving Discount of 10 %, we can increase our Profit for all the 3 categories. However, giving discount of more than 10% will make our Profit margin decrease and it keeps on decreasing as we increase the discount.

Hence an ideal percant of 10% discount is feasible.

# Plotting all the variables 

As we can see above in the pairplot, there is not much a relation among the variables

#plot correlation matrix
corr_mat = super_store.corr()
sns.heatmap(corr_mat, cmap=’coolwarm’, annot=True)

As you can see there seems to be a positive correlation between Sales & Profit. Since it is positive, it means with increase in Sales, Profit increases. There is a negative correlation between Discount & Profit.

However, the correlations are not strong enough for any conclusion to make.

# CatPlot for Sub-Category vs Profit
sns.catplot(x="Sub-Category", y="Profit_r", data=super_store, kind='bar', height=5, aspect=2, palette="Set1")
plt.xticks(rotation=90, fontsize=13)

In the above plot, we can see that Copiers made the highest profit among all the Sub-Category products. Other categories are Phones, Accessories, Appliances, Machines & Envelopes which had made considerable amount of profit.

However, Tables made a neglible contribution in terms of profit. Other products like Fasteners & Supplies didn’t made much of either.

View the entire code at my github account: https://github.com/Anchit13/AnchitBhagat/blob/main/Super%20Store.ipynb

M.Sc Data Analytics (QUB 2021-22 batch) l Experienced HR Analyst l Travel-freak l French beginner l

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store