This report analyzes the Brazilian E-commerce Public Dataset by Olist to identify the key drivers of negative customer reviews. Internal reports at Olist suggest a rise in poor customer satisfaction, which poses a threat to long-term growth.
The central business question this analysis seeks to answer is: “What are the key drivers of negative customer reviews, and how can we use these insights to improve the overall customer experience?”
The data for this analysis was prepared and cleaned using BigQuery
(SQL). The raw dataset, consisting of 9 separate .csv
files, was joined, cleaned, and enriched with several new features. The
final, cleaned datasets were then exported as CSV files for analysis in
R.
The key datasets loaded below are: *
simplified_seller_performance_review.csv
: Contains
order-level data with engineered features like delivery delays and
review categories. * Category_negative_review.csv
: Contains
aggregated data on review scores per product category. *
negative_review_distribution.csv
: Contains aggregated data
on review scores per seller geographic location.
# Load the datasets from the data/ folder
simplified_seller_performance_review <- read.csv("data/simplified_seller_performance_review.csv")
Category_negative_review <- read.csv("data/Category_negative_review.csv")
negative_review_distribution <- read.csv("data/negative_review_distribution.csv")
The analysis is structured around three key findings that directly address the guiding questions of the case study.
The first and most significant finding is the strong correlation between delivery delays and negative reviews. While not all negative reviews are caused by delays, a late delivery dramatically increases the likelihood of customer dissatisfaction.
The density plot below shows the distribution of reviews based on delivery time. There is a clear peak of negative reviews for orders delivered late (right of the dashed line). However, a smaller peak for negative reviews also exists for on-time or early deliveries, suggesting other factors are also at play.
# Density Plot
ggplot(simplified_seller_performance_review,
aes(x = delivery_diff_days, fill = review_category, color = review_category)) +
geom_density(alpha = 0.5) +
geom_vline(xintercept = 0, linetype = "dashed", color = "black") +
labs(title = "Delivery Delay Profiles for Positive vs. Negative Reviews",
x = "Delivery Delay (Days)",
y = "Density") +
theme_minimal(base_size = 14)
The scatter plot provides a more granular view, showing each individual order. It confirms that when an order is late, there is a very high concentration of negative reviews (red dots).
# Scatter Plot
ggplot(simplified_seller_performance_review,
aes(y = seller_processing_days,
x = delivery_diff_days,
color = review_category)) +
geom_jitter(alpha = 0.6, width = 0.3, height = 0.3) +
geom_vline(xintercept = 0, linetype = "dashed", color = "black", linewidth = 0.8) +
annotate("text", x = -150, y = 80, label = "Delivered Early", hjust = 0, size = 4, color = "darkgreen") +
annotate("text", x = 100, y = 80, label = "Delivered Late", hjust = 0, size = 4, color = "darkred") +
scale_y_continuous(expand = c(0, 0), limits = c(-5, NA)) +
labs(title = 'Review vs Delivery Date',
x = 'Delivery Delays (Days)',
y = 'Seller Processing Days') +
theme_minimal(base_size = 14)
The analysis also identified specific product categories that have a disproportionately high rate of negative reviews. This suggests that for some products, issues with quality, damage, or inaccurate descriptions are also significant drivers of customer dissatisfaction.
The bar chart below shows the top 20 product categories with the highest percentage of negative reviews.
# Prepare data for the bar chart
category_data <- Category_negative_review
top_categories <- category_data %>%
arrange(desc(percentage_negative_reviews)) %>%
head(20)
# Bar Chart
ggplot(top_categories,
aes(x = percentage_negative_reviews,
y = reorder(category_name_display, percentage_negative_reviews),
fill = percentage_negative_reviews)) +
geom_col() +
scale_fill_gradient(low = "green", high = "red", name = "% Negative") +
labs(
title = "Review Score as per Product Category",
subtitle = "Top 20 categories with the highest percentage of negative reviews",
x = "Percentage Negative Reviews",
y = "Category Name"
) +
theme_minimal(base_size = 14) +
theme(
legend.position = "right",
plot.title = element_text(face = "bold", size = 18),
axis.text.y = element_text(size = 12)
)
Finally, the analysis revealed geographic “hotspots” where sellers have a higher rate of negative reviews. While no single state is exclusively negative, there is a clear concentration of issues in the southeastern region of Brazil.
# Prepare map data
brazil_map <- ne_countries(scale = "medium", country = "Brazil", returnclass = "sf")
geographic_data <- negative_review_distribution
# Map Visualization
ggplot() +
geom_sf(data = brazil_map, fill = "gray90", color = "white") +
geom_point(data = geographic_data,
aes(x = geolocation_lng,
y = geolocation_lat,
size = total_reviews,
color = percentage_negative_reviews),
alpha = 0.6) +
scale_color_gradient(low = "green", high = "red", name = "% Negative Reviews") +
scale_size_continuous(range = c(2, 12), name = "Total Reviews") +
labs(
title = "Geographic Hotspots of Negative Reviews",
subtitle = "Seller locations in southeastern Brazil show a higher concentration of negative reviews",
caption = "Data Source: Olist Public Dataset"
) +
theme_void(base_size = 14) +
theme(
legend.position = "right",
plot.title = element_text(face = "bold", size = 18, hjust = 0.5),
plot.subtitle = element_text(size = 12, hjust = 0.5, color = "gray40"),
plot.background = element_rect(fill = "white", color = NA)
)
Based on the analysis, the following three actionable recommendations are proposed to the Olist Operations & Strategy Team:
processing_time
, Olist can directly
address the root cause of many negative reviews.