1. Introduction: The Business Problem

This report analyzes the Brazilian E-commerce Public Dataset by Olist to identify the key drivers of negative customer reviews. Internal reports at Olist suggest a rise in poor customer satisfaction, which poses a threat to long-term growth.

The central business question this analysis seeks to answer is: “What are the key drivers of negative customer reviews, and how can we use these insights to improve the overall customer experience?”


2. Data Preparation

The data for this analysis was prepared and cleaned using BigQuery (SQL). The raw dataset, consisting of 9 separate .csv files, was joined, cleaned, and enriched with several new features. The final, cleaned datasets were then exported as CSV files for analysis in R.

The key datasets loaded below are: * simplified_seller_performance_review.csv: Contains order-level data with engineered features like delivery delays and review categories. * Category_negative_review.csv: Contains aggregated data on review scores per product category. * negative_review_distribution.csv: Contains aggregated data on review scores per seller geographic location.

# Load the datasets from the data/ folder
simplified_seller_performance_review <- read.csv("data/simplified_seller_performance_review.csv")
Category_negative_review <- read.csv("data/Category_negative_review.csv")
negative_review_distribution <- read.csv("data/negative_review_distribution.csv")

3. Analysis & Visualizations

The analysis is structured around three key findings that directly address the guiding questions of the case study.

Finding 1: Late Delivery is a Major Driver of Negative Reviews

The first and most significant finding is the strong correlation between delivery delays and negative reviews. While not all negative reviews are caused by delays, a late delivery dramatically increases the likelihood of customer dissatisfaction.

The density plot below shows the distribution of reviews based on delivery time. There is a clear peak of negative reviews for orders delivered late (right of the dashed line). However, a smaller peak for negative reviews also exists for on-time or early deliveries, suggesting other factors are also at play.

# Density Plot
ggplot(simplified_seller_performance_review, 
       aes(x = delivery_diff_days, fill = review_category, color = review_category)) +
  geom_density(alpha = 0.5) +
  geom_vline(xintercept = 0, linetype = "dashed", color = "black") +
  labs(title = "Delivery Delay Profiles for Positive vs. Negative Reviews",
       x = "Delivery Delay (Days)",
       y = "Density") +
  theme_minimal(base_size = 14)

The scatter plot provides a more granular view, showing each individual order. It confirms that when an order is late, there is a very high concentration of negative reviews (red dots).

# Scatter Plot
ggplot(simplified_seller_performance_review,
       aes(y = seller_processing_days,
           x = delivery_diff_days,
           color = review_category)) +
  geom_jitter(alpha = 0.6, width = 0.3, height = 0.3) +
  geom_vline(xintercept = 0, linetype = "dashed", color = "black", linewidth = 0.8) +
  annotate("text", x = -150, y = 80, label = "Delivered Early", hjust = 0, size = 4, color = "darkgreen") +
  annotate("text", x = 100, y = 80, label = "Delivered Late", hjust = 0, size = 4, color = "darkred") +
  scale_y_continuous(expand = c(0, 0), limits = c(-5, NA)) +
  labs(title = 'Review vs Delivery Date',
       x = 'Delivery Delays (Days)',
       y = 'Seller Processing Days') +
  theme_minimal(base_size = 14)

Finding 2: Certain Product Categories Consistently Underperform

The analysis also identified specific product categories that have a disproportionately high rate of negative reviews. This suggests that for some products, issues with quality, damage, or inaccurate descriptions are also significant drivers of customer dissatisfaction.

The bar chart below shows the top 20 product categories with the highest percentage of negative reviews.

# Prepare data for the bar chart
category_data <- Category_negative_review
top_categories <- category_data %>%
  arrange(desc(percentage_negative_reviews)) %>%
  head(20)

# Bar Chart
ggplot(top_categories, 
       aes(x = percentage_negative_reviews, 
           y = reorder(category_name_display, percentage_negative_reviews),
           fill = percentage_negative_reviews)) +
  geom_col() +
  scale_fill_gradient(low = "green", high = "red", name = "% Negative") +
  labs(
    title = "Review Score as per Product Category",
    subtitle = "Top 20 categories with the highest percentage of negative reviews",
    x = "Percentage Negative Reviews",
    y = "Category Name"
  ) +
  theme_minimal(base_size = 14) +
  theme(
    legend.position = "right",
    plot.title = element_text(face = "bold", size = 18),
    axis.text.y = element_text(size = 12)
  )

Finding 3: Seller Performance Varies by Region

Finally, the analysis revealed geographic “hotspots” where sellers have a higher rate of negative reviews. While no single state is exclusively negative, there is a clear concentration of issues in the southeastern region of Brazil.

# Prepare map data
brazil_map <- ne_countries(scale = "medium", country = "Brazil", returnclass = "sf")
geographic_data <- negative_review_distribution

# Map Visualization
ggplot() +
  geom_sf(data = brazil_map, fill = "gray90", color = "white") +
  geom_point(data = geographic_data, 
             aes(x = geolocation_lng, 
                 y = geolocation_lat, 
                 size = total_reviews, 
                 color = percentage_negative_reviews),
             alpha = 0.6) +
  scale_color_gradient(low = "green", high = "red", name = "% Negative Reviews") +
  scale_size_continuous(range = c(2, 12), name = "Total Reviews") +
  labs(
    title = "Geographic Hotspots of Negative Reviews",
    subtitle = "Seller locations in southeastern Brazil show a higher concentration of negative reviews",
    caption = "Data Source: Olist Public Dataset"
  ) +
  theme_void(base_size = 14) +
  theme(
    legend.position = "right",
    plot.title = element_text(face = "bold", size = 18, hjust = 0.5),
    plot.subtitle = element_text(size = 12, hjust = 0.5, color = "gray40"),
    plot.background = element_rect(fill = "white", color = NA)
  )


4. Recommendations

Based on the analysis, the following three actionable recommendations are proposed to the Olist Operations & Strategy Team:

  1. Launch a Tiered Seller Performance Program Focused on Shipping Speed.
    • Why: The data proves that shipping delays are the single biggest driver of customer dissatisfaction. By incentivizing sellers to reduce their processing_time, Olist can directly address the root cause of many negative reviews.
    • How: Establish clear KPIs for seller processing time; offer incentives like “fast shipper” badges or commission discounts; and implement a warning system for consistently slow sellers.
  2. Conduct a Quality Audit on the Lowest-Performing Product Categories.
    • Why: These categories underperform even when delivery times are accounted for, suggesting deeper issues with product quality or damage. This directly impacts customer trust and Olist’s brand reputation.
    • How: The Operations team should partner with the Seller Relationship team to investigate sellers in these categories, review product listings for accuracy, and consider stricter packaging requirements.
  3. Optimize Logistics Partnerships in Underperforming Geographic Regions.
    • Why: The geographic analysis indicates that systemic, regional issues are at play, likely related to the performance of specific logistics carriers.
    • How: Use the performance data to hold current logistics partners accountable in these regions, explore partnerships with new local carriers, and adjust delivery estimates to be more realistic in the short term.