# Exploratory Data Analysis — Retail using Python

## We will Visualize a Superstore Dataset related to different categories of Office Supplies, Furniture & Technological products, by conducting EDA on the dataset through the use of scatter plots & heatmaps. After that, we will try to figure out the variables which are directly / indirectly related to Profit variable.

# Read the file & display first 5 rows
# Rows containing duplicate data
duplicate_rows_super_store = super_store[super_store.duplicated()]
print("number of duplicate rows: ", duplicate_rows_super_store.shape)

We will also perform certain statistical inferences on the data (describe command), check the type of variables (info command), observe the shape & columns list (detailed code in my github account).

# Outlier Detection

#Checking for outliers
plt.figure(figsize=[10,5])
sns.set(style=”whitegrid”)
ax = sns.boxplot(data=super_store)
plt.show()

There are number of Outliers in Sales & Profit. Hence we need to remove them

def remove_outlier(col):
sorted(col)
Q1, Q3 = col.quantile([.25, .75])
IQR = Q3-Q1
lower_range=Q1-(1.5+IQR)
upper_range=Q3+(1.5+IQR)
return lower_range, upper_range

# Exploratory Data Analysis

## Analysing the categorical variables

super_store.Category.value_counts(ascending=False)
sns.countplot(x="Category", data=super_store) plt.show()
super_store.Segment.value_counts(ascending=False)
sns.catplot(x="Segment", y="Profit_r", kind='bar', hue='Category', data=super_store)

Though the Category of Office Supplies was more in number, Technology category brought the highest profit in all the 3 Segments

sns.countplot(x=”Segment”, data=super_store)
plt.show()
# CatPlot for State vs Profit
plt.figure(figsize=(15,8))
sns.catplot(x="State", y="Profit_r", data=super_store, kind='bar', height=5, aspect=2, palette="Set1")
plt.xticks(rotation=90, fontsize=13)
plt.yticks(fontsize=10)
plt.show()

We can see Wyoming & Vermont are among the highest Profit States alongwith others like Montana, Rhode Island, Indiana & Minnesota. However, states like Ohio, Texas, Pennsylvania & Illinois are making losses.

super_store.Region.value_counts(ascending=False)
sns.countplot(x="Region", data=super_store)
plt.show()
# Using Lineplot
df = pd.DataFrame(dict(time=np.arange(500),
value=np.random.randn(500).cumsum()))
g = sns.relplot(x="Discount", y="Profit_r", hue = 'Category', kind="line", data=super_store)
g.fig.autofmt_xdate()

We can observe that by giving Discount of 10 %, we can increase our Profit for all the 3 categories. However, giving discount of more than 10% will make our Profit margin decrease and it keeps on decreasing as we increase the discount.

Hence an ideal percant of 10% discount is feasible.

# Plotting all the variables
sns.pairplot(super_store)
sns.set(style="whitegrid")

As we can see above in the pairplot, there is not much a relation among the variables

#plot correlation matrix
plt.figure(figsize=[20,10])
corr_mat = super_store.corr()
sns.set(style=”whitegrid”)
sns.heatmap(corr_mat, cmap=’coolwarm’, annot=True)

As you can see there seems to be a positive correlation between Sales & Profit. Since it is positive, it means with increase in Sales, Profit increases. There is a negative correlation between Discount & Profit.

However, the correlations are not strong enough for any conclusion to make.

# CatPlot for Sub-Category vs Profit
plt.figure(figsize=(15,8))
sns.catplot(x="Sub-Category", y="Profit_r", data=super_store, kind='bar', height=5, aspect=2, palette="Set1")
plt.xticks(rotation=90, fontsize=13)
plt.yticks(fontsize=10)
plt.show()

In the above plot, we can see that Copiers made the highest profit among all the Sub-Category products. Other categories are Phones, Accessories, Appliances, Machines & Envelopes which had made considerable amount of profit.

However, Tables made a neglible contribution in terms of profit. Other products like Fasteners & Supplies didn’t made much of either.

View the entire code at my github account: https://github.com/Anchit13/AnchitBhagat/blob/main/Super%20Store.ipynb

M.Sc Data Analytics (QUB 2021-22 batch) l Experienced HR Analyst l Travel-freak l French beginner l

## More from Anchit Bhagat

M.Sc Data Analytics (QUB 2021-22 batch) l Experienced HR Analyst l Travel-freak l French beginner l

## Skip the Data Preprocessing! Accessing 12 Ready-to-Go Datasets

Get the Medium app