Spicing Up Data Science: A Delicious Blend of Analytics and Gourmet Delights

Exploratory Data Analysis of Indian Food 101 Dataset


Zahid Asghar


June 24, 2024


In this analysis, we used the Indian Food 101 dataset from Kaggle (Prabhavalkar 2023).I will this data rathat I call it subcontinent Food dataset, using from very basics of data handling, analysis, and visualization techniques with R and the Tidyverse suite. The dataset contains information about various Indian dishes, their ingredients, dietary preferences, and preparation times. We will explore the dataset to understand the distribution of dishes by course, flavor profile, and diet type. We will also analyze the preparation and cooking times of the dishes and identify the most common ingredients used in Indian cuisine.

Load Data and Libraries

# Load necessary libraries and read the data
food <- read_csv(here::here("data", "indian_food.csv"))

Inspection of Data

After loading the data, we can inspect the structure, summary statistics, and the first few rows of the dataset. This helps us understand the variables and their types. There are 255 rows and 9 columns in the dataset.

First 6 rows of the dataset:

# Display the first few rows and summary statistics
food |> head()
# A tibble: 6 × 9
  name  ingredients diet  prep_time cook_time flavor_profile course state region
  <chr> <chr>       <chr>     <dbl>     <dbl> <chr>          <chr>  <chr> <chr> 
1 Balu… Maida flou… vege…        45        25 sweet          desse… West… East  
2 Boon… Gram flour… vege…        80        30 sweet          desse… Raja… West  
3 Gaja… Carrots, m… vege…        15        60 sweet          desse… Punj… North 
4 Ghev… Flour, ghe… vege…        15        30 sweet          desse… Raja… West  
5 Gula… Milk powde… vege…        15        40 sweet          desse… West… East  
6 Imar… Sugar syru… vege…        10        50 sweet          desse… West… East  

Colnames of the dataset

[1] "name"           "ingredients"    "diet"           "prep_time"     
[5] "cook_time"      "flavor_profile" "course"         "state"         
[9] "region"        

Taxonomy of variables

It is important to understand the data types of each variable in the dataset. We can use the map_chr function from the purrr package to get the class of each variable.

food |> map_chr(class)
          name    ingredients           diet      prep_time      cook_time 
   "character"    "character"    "character"      "numeric"      "numeric" 
flavor_profile         course          state         region 
   "character"    "character"    "character"    "character" 

From the output, we can see that the dataset contains a mix of character, factor, and numeric variables. One needs to use the appropriate data manipulation functions to handle each type of variable. For example, we can use bar plots to visualize the distribution of categorical variables and histograms for numeric variables. Once we have a good understanding of the data, we can proceed with exploratory data analysis (EDA) to uncover patterns and insights.

Exploratory Data Analysis (EDA)

Missing Value Analysis

It is important to check for missing values in the dataset before performing any analysis. Missing values can affect the results of statistical tests and machine learning models. We can use the colSums function to count the number of missing values in each column.

# Check for missing values in each column
          name    ingredients           diet      prep_time      cook_time 
             1              0              0              0              0 
flavor_profile         course          state         region 
             0              0              0              1 

The output shows that there are missing values in the name and region columns. We need to handle these missing values before proceeding with the analysis. So lets see region and think whether we can fill the missing values or not.

# Display unique values in the region column
food %>%
  select(region) %>%
  distinct() |> gt()
North East

There is one NA value and lets see this row to see if we can fill the missing value or not.

# Display the row with missing region value
food %>%
  filter(is.na(region)) |> gt()
name ingredients diet prep_time cook_time flavor_profile course state region
Panjeeri Whole wheat flour, musk melon seeds, poppy seeds, edible gum, semolina vegetarian 10 25 sweet dessert Uttar Pradesh NA

One can see that region is NA and state is Uttar Pradesh. We can fill the missing value with North as Uttar Pradesh is in North India. Here is another important data wrangligh verb mutate which is used to create new columns or modify existing columns in the dataset. We can use the if_else function to fill the missing region value with ‘North’.

# Fill missing region value with 'North'
food <- food %>%
  mutate(region = if_else(is.na(region), "North", region))

food |> filter(state=="Uttar Pradesh") |> gt()
name ingredients diet prep_time cook_time flavor_profile course state region
Jalebi Maida, corn flour, baking soda, vinegar, curd, water, turmeric, saffron, cardamom vegetarian 10 50 sweet dessert Uttar Pradesh North
Petha Firm white pumpkin, sugar, kitchen lime, alum powder vegetarian 10 30 sweet dessert Uttar Pradesh North
Rabri Condensed milk, sugar, spices, nuts vegetarian 10 45 sweet dessert Uttar Pradesh North
Sohan halwa Corn flour, ghee, dry fruits vegetarian 10 60 sweet dessert Uttar Pradesh North
Kachori Moong dal, rava, garam masala, dough, fennel seeds vegetarian 30 60 spicy snack Uttar Pradesh North
Kofta Paneer, potato, cream, corn flour, garam masala vegetarian 20 40 spicy main course Uttar Pradesh North
Lauki ke kofte Bottle gourd, garam masala powder, gram flour, ginger, chillies vegetarian 20 40 spicy main course Uttar Pradesh North
Navrattan korma Green beans, potatoes, khus khus, low fat, garam masala powder vegetarian 25 40 spicy main course Uttar Pradesh North
Panjeeri Whole wheat flour, musk melon seeds, poppy seeds, edible gum, semolina vegetarian 10 25 sweet dessert Uttar Pradesh North

Perfect we have replaced missing region value with North as Uttar Pradesh is in North India. Now we can proceed with the analysis.

Detailed Category Counts

We can use the count function from the dplyr package to count the number of entries in each category and arrange them in descending order. This helps us understand the distribution of dishes by course, flavor profile, and diet type.

food %>%
  count(course) %>%
  arrange(desc(n)) |> gt()
Table 1: Distribution of Dishes by Course
course n
main course 129
dessert 85
snack 39
starter 2

Table 1 indicates that majory of the dishes are classified as main course and dessert, followed by snack and starter. We can also count the number of dishes by flavor profile and diet type.

Visualization of Data

Lets see the distribution of dishes by course and flavor profile using bar plots. We can use the ggplot2 package to create the plots. The geom_col function is used to create bar plots, and the facet_wrap function is used to create multiple plots based on a categorical variable.

food %>%
  count(course, flavor_profile) %>%
  ggplot(aes(x = reorder(course, -n), y = n, fill = flavor_profile)) +
  geom_col() +
  labs(title = 'Distribution of Courses by Flavor Profile') + #facet_wrap(~flavor_profile, scales = "free")+
  coord_flip() + theme_minimal() + theme(legend.position = "none")
Figure 1: Distribution of Courses by Flavor Profile

Many a times we use stacked bar plots to visualize the distribution of data by variables. One of the drawback of stacked bar plots is that it is difficult to compare the counts of different categories as there is no common baseline.

Small multiples

Small multiples are a powerful visualization technique that allows us to compare multiple plots side by side. We can use the facet_wrap function in ggplot2 to create small multiples based on a categorical variable. In this case, we will create small multiples for the distribution of dishes by course and diet type.

food %>%
  count(course, flavor_profile) %>%
  ggplot(aes(x = reorder(course, -n), y = n, fill = flavor_profile)) +
  geom_col() +
  labs(title = 'Distribution of Courses by Flavor Profile',x="",y="") + facet_wrap(~flavor_profile, scales = "free")+
  coord_flip() + theme_minimal() + theme(legend.position = "none")
Figure 2: Distribution of Dishes by Course and Diet Type

Now if wee compare the Figure 1 and Figure 2, we can see that small multiples are more effective in comparing the distribution of dishes by course and diet type. We can also create small multiples for other categorical variables to gain more insights from the data.

Analysis of Preparation and Cooking Times

When there are two continuous variables in the dataset, we can use scatter plots to visualize the relationship between them. In this case, we can create a scatter plot of preparation time vs cooking time to see if there is any relationship between the two variables.

ggplot(food, aes(x = prep_time, y = cook_time, color = diet)) + 
  geom_point() +
  labs(title = "Preparation Time vs Cooking Time") +
  theme_minimal() + theme(legend.position = "top")
Figure 3: Preparation Time vs Cooking Time

Figure 3 shows a scatter plot of preparation time vs cooking time for different diet types. We can see that there is no clear relationship between the two variables, and the points are scattered across the plot. One of the reasons maybe that there are some unusually large values in the dataset. One has to explore whether these values are valid or not.

Foods with maximum preparation and cooking time

We can use the slice_max function from the dplyr package to find the foods with the maximum preparation and cooking time. This helps us identify the dishes that take the longest time to prepare and cook.

food %>% mutate(tot_time = prep_time + cook_time) |> 
  slice_max(order_by = tot_time, n = 5) %>%
  select(name, prep_time, cook_time, tot_time) |> gt()
Table 2: Foods with Maximum Preparation and Cooking Time
name prep_time cook_time tot_time
Shrikhand 10 720 730
Pindi chana 500 120 620
Puttu 495 40 535
Misti doi 480 30 510
Dosa 360 90 450
Idli 360 90 450
Masala Dosa 360 90 450

Table 2 shows the foods with the maximum preparation and cooking time. We can see that Pindi Chana has the maximum preparation time of 500 minutes, while Shrikhand has the maximum cooking time of 720 minutes. We can also calculate the total time for each dish by adding the preparation and cooking time.

Lets filter data by tot_time less than 200 minutes and visualize the data using scatter plot.

food %>% mutate(tot_time = prep_time + cook_time) |> 
  filter(tot_time < 200) |> 
  ggplot(aes(x = prep_time, y = cook_time, color = diet)) + 
  geom_point() +
  labs(title = "Preparation Time vs Cooking Time (Total Time < 200 minutes)") +
  theme_minimal() + theme(legend.position = "top")
Figure 4: Preparation Time vs Cooking Time (Total Time < 200 minutes)

Group by State and Region

We can use the group_by and summarize functions from the dplyr package to group the data by state and region and calculate the total number of dishes in each group. This helps us understand the distribution of dishes by state and region.

food %>%
  group_by(state) %>%
  summarize(n = n()) %>%
  arrange(desc(n)) |> slice_head(n=10) |> gt()
Table 3: Number of Dishes by State
state n
Gujarat 35
Punjab 32
Maharashtra 30
-1 24
West Bengal 24
Assam 21
Tamil Nadu 20
Andhra Pradesh 10
Uttar Pradesh 9
Kerala 8

Table 3 shows the number of dishes by state. We can see that Gujarat has the highest number of dishes, followed by Punjab and Maharashtra. We can also group the data by region and calculate the total number of dishes in each region.

food %>%
  group_by(region) %>%
  summarize(n = n()) %>%
  arrange(desc(n)) |> slice_head(n=10) |> gt()
Table 4: Number of Dishes by Region
region n
West 74
South 59
North 50
East 31
North East 25
-1 13
Central 3

Table 4 shows the number of dishes by region. We can see that West region has the highest number of dishes, followed by North and South regions. We can also visualize the distribution of dishes by state and region using bar plots.

Visualize state and region data

We can visualize the distribution of dishes by state and region using bar plots. We can use the geom_col function in ggplot2 to create bar plots of the number of dishes by state and region.

food %>%
  group_by(state) %>%
  summarize(n = n()) %>%
  arrange(desc(n)) |> slice_head(n=10) |> 
  ggplot(aes(x = reorder(state, n), y = n, fill = state)) +  
  geom_col() + # geom_text(aes(label = n), vjust = -0.5, size = 3) +
  labs(title = 'Number of Dishes by State', x = "", y = "") +
  coord_flip() + 
  theme_minimal() +
    legend.position = "none",
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    axis.text.x = element_text(hjust = 0)  # Adjust text justification if necessary
  ) +
  scale_y_continuous(expand = expansion(mult = c(0, 0.05)))
Figure 5: Number of Dishes by State
food %>%
  group_by(region) %>%
  summarize(n = n()) %>%
  arrange(desc(n)) |> slice_head(n=10) |> 
  ggplot(aes(x = reorder(region, n), y = n, fill = region)) +  
  geom_col() + # geom_text(aes(label = n), vjust = -0.5, size = 3) +
  labs(title = 'Number of Dishes by Region', x = "", y = "") +
  coord_flip() + 
  theme_minimal() +
    legend.position = "none",
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    axis.text.x = element_text(hjust = 0)  # Adjust text justification if necessary
  ) +
  scale_y_continuous(expand = expansion(mult = c(0, 0.05)))
Figure 6: Number of Dishes by Region

Ingredients analysis

We can analyze the ingredients used in Indian dishes to identify the most common ingredients. We can split the ingredients column into individual ingredients, count the frequency of each ingredient, and visualize the most common ingredients using a bar plot.

food %>%
  mutate(Ingredients = strsplit(as.character(ingredients), ", ")) %>%
  unnest(Ingredients) %>%
  count(Ingredients) %>%
  arrange(desc(n)) |> slice_head(n=10) |> gt()
Table 5: Most Common Ingredients in Indian Dishes
Ingredients n
sugar 44
ginger 29
garam masala 27
curry leaves 25
ghee 25
jaggery 19
urad dal 17
Rice flour 16
milk 15
tomato 15
food %>%
  mutate(Ingredients = strsplit(as.character(ingredients), ", ")) %>%
  unnest(Ingredients) %>%
  count(Ingredients) %>%
  arrange(desc(n)) |> slice_head(n=10) |> 
  ggplot(aes(x = reorder(Ingredients, n), y = n, fill = Ingredients)) +  
  geom_col() +  
  labs(title = 'Most Common Ingredients in Indian Dishes', x = "", y = "") +
  coord_flip() + 
  theme_minimal() +
    legend.position = "none",
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    axis.text.x = element_text(hjust = 0)  # Adjust text justification if necessary
  ) +
  scale_y_continuous(expand = expansion(mult = c(0, 0.05)))
Figure 7: Most Common Ingredients in Indian Dishes

Additional Visualizations and Summary Tables

Pie Chart of Dietary Preferences

# Vegetarian vs Non-Vegetarian distribution
food %>%
  count(diet) %>%
  ggplot(aes(x = "", y = n, fill = diet)) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar("y") + 
    labs(title = "Diet Composition of Indian Foods") +

Advanced Table Visualizations

# Using `gt` to create advanced table visualizations for summary
food %>%
  count(course, flavor_profile) %>%
course flavor_profile n
dessert sweet 85
main course -1 26
main course bitter 3
main course sour 1
main course spicy 96
main course sweet 3
snack -1 3
snack bitter 1
snack spicy 35
starter spicy 2

Summary Table of Food Categories

tabyl(food, course, flavor_profile) |> gt()
course -1 bitter sour spicy sweet
dessert 0 0 0 0 85
main course 26 3 1 96 3
snack 3 1 0 35 0
starter 0 0 0 2 0

Text data analysis

Word Cloud of Ingredients

# Word cloud of ingredients

food %>%
  # Assuming 'ingredients' is a column that contains a string of ingredients separated by commas
  mutate(Ingredients = strsplit(as.character(ingredients), ",\\s*")) %>%
  unnest(Ingredients) %>%
  count(Ingredients, name = "Frequency") %>%
  arrange(desc(Frequency)) %>%
  # Optionally slice the top 100 most frequent ingredients
   slice_head(n = 100) %>%
  wordcloud2(size = 0.5)  # Adjust the size parameter as needed
Figure 8: Word cloud of ingredients


