Introduction

In this analysis, we used the Indian Food 101 dataset from Kaggle (Prabhavalkar 2023).I will this data rathat I call it subcontinent Food dataset, using from very basics of data handling, analysis, and visualization techniques with R and the Tidyverse suite. The dataset contains information about various Indian dishes, their ingredients, dietary preferences, and preparation times. We will explore the dataset to understand the distribution of dishes by course, flavor profile, and diet type. We will also analyze the preparation and cooking times of the dishes and identify the most common ingredients used in Indian cuisine.

Load Data and Libraries

Code

# Load necessary libraries and read the data
library(tidyverse)
library(gt)
library(gtsummary)
library(janitor)
food <- read_csv(here::here("data", "indian_food.csv"))

Inspection of Data

After loading the data, we can inspect the structure, summary statistics, and the first few rows of the dataset. This helps us understand the variables and their types. There are 255 rows and 9 columns in the dataset.

First 6 rows of the dataset:

Code

# Display the first few rows and summary statistics
food |> head()

# A tibble: 6 × 9
  name  ingredients diet  prep_time cook_time flavor_profile course state region
  <chr> <chr>       <chr>     <dbl>     <dbl> <chr>          <chr>  <chr> <chr> 
1 Balu… Maida flou… vege…        45        25 sweet          desse… West… East  
2 Boon… Gram flour… vege…        80        30 sweet          desse… Raja… West  
3 Gaja… Carrots, m… vege…        15        60 sweet          desse… Punj… North 
4 Ghev… Flour, ghe… vege…        15        30 sweet          desse… Raja… West  
5 Gula… Milk powde… vege…        15        40 sweet          desse… West… East  
6 Imar… Sugar syru… vege…        10        50 sweet          desse… West… East

Colnames of the dataset

Code

colnames(food)

[1] "name"           "ingredients"    "diet"           "prep_time"     
[5] "cook_time"      "flavor_profile" "course"         "state"         
[9] "region"

Taxonomy of variables

It is important to understand the data types of each variable in the dataset. We can use the map_chr function from the purrr package to get the class of each variable.

Code

food |> map_chr(class)

          name    ingredients           diet      prep_time      cook_time 
   "character"    "character"    "character"      "numeric"      "numeric" 
flavor_profile         course          state         region 
   "character"    "character"    "character"    "character"

From the output, we can see that the dataset contains a mix of character, factor, and numeric variables. One needs to use the appropriate data manipulation functions to handle each type of variable. For example, we can use bar plots to visualize the distribution of categorical variables and histograms for numeric variables. Once we have a good understanding of the data, we can proceed with exploratory data analysis (EDA) to uncover patterns and insights.

Exploratory Data Analysis (EDA)

Missing Value Analysis

It is important to check for missing values in the dataset before performing any analysis. Missing values can affect the results of statistical tests and machine learning models. We can use the colSums function to count the number of missing values in each column.

Code

# Check for missing values in each column
colSums(is.na(food))

          name    ingredients           diet      prep_time      cook_time 
             1              0              0              0              0 
flavor_profile         course          state         region 
             0              0              0              1

The output shows that there are missing values in the name and region columns. We need to handle these missing values before proceeding with the analysis. So lets see region and think whether we can fill the missing values or not.

Code

# Display unique values in the region column
food %>%
  select(region) %>%
  distinct() |> gt()

region
East
West
North
-1
North East
South
Central
NA

There is one NA value and lets see this row to see if we can fill the missing value or not.

Code

# Display the row with missing region value
food %>%
  filter(is.na(region)) |> gt()

name	ingredients	diet	prep_time	cook_time	flavor_profile	course	state	region
Panjeeri	Whole wheat flour, musk melon seeds, poppy seeds, edible gum, semolina	vegetarian	10	25	sweet	dessert	Uttar Pradesh	NA

One can see that region is NA and state is Uttar Pradesh. We can fill the missing value with North as Uttar Pradesh is in North India. Here is another important data wrangligh verb mutate which is used to create new columns or modify existing columns in the dataset. We can use the if_else function to fill the missing region value with ‘North’.

Code

# Fill missing region value with 'North'
food <- food %>%
  mutate(region = if_else(is.na(region), "North", region))

food |> filter(state=="Uttar Pradesh") |> gt()

name	ingredients	diet	prep_time	cook_time	flavor_profile	course	state	region
Jalebi	Maida, corn flour, baking soda, vinegar, curd, water, turmeric, saffron, cardamom	vegetarian	10	50	sweet	dessert	Uttar Pradesh	North
Petha	Firm white pumpkin, sugar, kitchen lime, alum powder	vegetarian	10	30	sweet	dessert	Uttar Pradesh	North
Rabri	Condensed milk, sugar, spices, nuts	vegetarian	10	45	sweet	dessert	Uttar Pradesh	North
Sohan halwa	Corn flour, ghee, dry fruits	vegetarian	10	60	sweet	dessert	Uttar Pradesh	North
Kachori	Moong dal, rava, garam masala, dough, fennel seeds	vegetarian	30	60	spicy	snack	Uttar Pradesh	North
Kofta	Paneer, potato, cream, corn flour, garam masala	vegetarian	20	40	spicy	main course	Uttar Pradesh	North
Lauki ke kofte	Bottle gourd, garam masala powder, gram flour, ginger, chillies	vegetarian	20	40	spicy	main course	Uttar Pradesh	North
Navrattan korma	Green beans, potatoes, khus khus, low fat, garam masala powder	vegetarian	25	40	spicy	main course	Uttar Pradesh	North
Panjeeri	Whole wheat flour, musk melon seeds, poppy seeds, edible gum, semolina	vegetarian	10	25	sweet	dessert	Uttar Pradesh	North

Perfect we have replaced missing region value with North as Uttar Pradesh is in North India. Now we can proceed with the analysis.

Detailed Category Counts

We can use the count function from the dplyr package to count the number of entries in each category and arrange them in descending order. This helps us understand the distribution of dishes by course, flavor profile, and diet type.

Code

food %>%
  count(course) %>%
  arrange(desc(n)) |> gt()

Table 1: Distribution of Dishes by Course

course	n
main course	129
dessert	85
snack	39
starter	2

Table 1 indicates that majory of the dishes are classified as main course and dessert, followed by snack and starter. We can also count the number of dishes by flavor profile and diet type.

Visualization of Data

Lets see the distribution of dishes by course and flavor profile using bar plots. We can use the ggplot2 package to create the plots. The geom_col function is used to create bar plots, and the facet_wrap function is used to create multiple plots based on a categorical variable.

Code

food %>%
  count(course, flavor_profile) %>%
  ggplot(aes(x = reorder(course, -n), y = n, fill = flavor_profile)) +
  geom_col() +
  labs(title = 'Distribution of Courses by Flavor Profile') + #facet_wrap(~flavor_profile, scales = "free")+
  coord_flip() + theme_minimal() + theme(legend.position = "none")

Figure 1: Distribution of Courses by Flavor Profile

Many a times we use stacked bar plots to visualize the distribution of data by variables. One of the drawback of stacked bar plots is that it is difficult to compare the counts of different categories as there is no common baseline.

Small multiples

Small multiples are a powerful visualization technique that allows us to compare multiple plots side by side. We can use the facet_wrap function in ggplot2 to create small multiples based on a categorical variable. In this case, we will create small multiples for the distribution of dishes by course and diet type.

Code

food %>%
  count(course, flavor_profile) %>%
  ggplot(aes(x = reorder(course, -n), y = n, fill = flavor_profile)) +
  geom_col() +
  labs(title = 'Distribution of Courses by Flavor Profile',x="",y="") + facet_wrap(~flavor_profile, scales = "free")+
  coord_flip() + theme_minimal() + theme(legend.position = "none")

Figure 2: Distribution of Dishes by Course and Diet Type

Now if wee compare the Figure 1 and Figure 2, we can see that small multiples are more effective in comparing the distribution of dishes by course and diet type. We can also create small multiples for other categorical variables to gain more insights from the data.

Analysis of Preparation and Cooking Times

When there are two continuous variables in the dataset, we can use scatter plots to visualize the relationship between them. In this case, we can create a scatter plot of preparation time vs cooking time to see if there is any relationship between the two variables.

Code

ggplot(food, aes(x = prep_time, y = cook_time, color = diet)) + 
  geom_point() +
  labs(title = "Preparation Time vs Cooking Time") +
  theme_minimal() + theme(legend.position = "top")

Figure 3: Preparation Time vs Cooking Time

Figure 3 shows a scatter plot of preparation time vs cooking time for different diet types. We can see that there is no clear relationship between the two variables, and the points are scattered across the plot. One of the reasons maybe that there are some unusually large values in the dataset. One has to explore whether these values are valid or not.

Foods with maximum preparation and cooking time

We can use the slice_max function from the dplyr package to find the foods with the maximum preparation and cooking time. This helps us identify the dishes that take the longest time to prepare and cook.

Code

food %>% mutate(tot_time = prep_time + cook_time) |> 
  slice_max(order_by = tot_time, n = 5) %>%
  select(name, prep_time, cook_time, tot_time) |> gt()

Table 2: Foods with Maximum Preparation and Cooking Time

name	prep_time	cook_time	tot_time
Shrikhand	10	720	730
Pindi chana	500	120	620
Puttu	495	40	535
Misti doi	480	30	510
Dosa	360	90	450
Idli	360	90	450
Masala Dosa	360	90	450

Table 2 shows the foods with the maximum preparation and cooking time. We can see that Pindi Chana has the maximum preparation time of 500 minutes, while Shrikhand has the maximum cooking time of 720 minutes. We can also calculate the total time for each dish by adding the preparation and cooking time.

Lets filter data by tot_time less than 200 minutes and visualize the data using scatter plot.

Code

food %>% mutate(tot_time = prep_time + cook_time) |> 
  filter(tot_time < 200) |> 
  ggplot(aes(x = prep_time, y = cook_time, color = diet)) + 
  geom_point() +
  labs(title = "Preparation Time vs Cooking Time (Total Time < 200 minutes)") +
  theme_minimal() + theme(legend.position = "top")

Figure 4: Preparation Time vs Cooking Time (Total Time < 200 minutes)

Group by State and Region

We can use the group_by and summarize functions from the dplyr package to group the data by state and region and calculate the total number of dishes in each group. This helps us understand the distribution of dishes by state and region.

Code

food %>%
  group_by(state) %>%
  summarize(n = n()) %>%
  arrange(desc(n)) |> slice_head(n=10) |> gt()

Table 3: Number of Dishes by State

state	n
Gujarat	35
Punjab	32
Maharashtra	30
-1	24
West Bengal	24
Assam	21
Tamil Nadu	20
Andhra Pradesh	10
Uttar Pradesh	9
Kerala	8

Table 3 shows the number of dishes by state. We can see that Gujarat has the highest number of dishes, followed by Punjab and Maharashtra. We can also group the data by region and calculate the total number of dishes in each region.

Code

food %>%
  group_by(region) %>%
  summarize(n = n()) %>%
  arrange(desc(n)) |> slice_head(n=10) |> gt()

Table 4: Number of Dishes by Region

region	n
West	74
South	59
North	50
East	31
North East	25
-1	13
Central	3

Table 4 shows the number of dishes by region. We can see that West region has the highest number of dishes, followed by North and South regions. We can also visualize the distribution of dishes by state and region using bar plots.

Visualize state and region data

We can visualize the distribution of dishes by state and region using bar plots. We can use the geom_col function in ggplot2 to create bar plots of the number of dishes by state and region.

Code

food %>%
  group_by(state) %>%
  summarize(n = n()) %>%
  arrange(desc(n)) |> slice_head(n=10) |> 
  ggplot(aes(x = reorder(state, n), y = n, fill = state)) +  
  geom_col() + # geom_text(aes(label = n), vjust = -0.5, size = 3) +
  labs(title = 'Number of Dishes by State', x = "", y = "") +
  coord_flip() + 
  theme_minimal() +
  theme(
    legend.position = "none",
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    axis.text.x = element_text(hjust = 0)  # Adjust text justification if necessary
  ) +
  scale_y_continuous(expand = expansion(mult = c(0, 0.05)))

Code

food %>%
  group_by(region) %>%
  summarize(n = n()) %>%
  arrange(desc(n)) |> slice_head(n=10) |> 
  ggplot(aes(x = reorder(region, n), y = n, fill = region)) +  
  geom_col() + # geom_text(aes(label = n), vjust = -0.5, size = 3) +
  labs(title = 'Number of Dishes by Region', x = "", y = "") +
  coord_flip() + 
  theme_minimal() +
  theme(
    legend.position = "none",
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    axis.text.x = element_text(hjust = 0)  # Adjust text justification if necessary
  ) +
  scale_y_continuous(expand = expansion(mult = c(0, 0.05)))

Ingredients analysis

We can analyze the ingredients used in Indian dishes to identify the most common ingredients. We can split the ingredients column into individual ingredients, count the frequency of each ingredient, and visualize the most common ingredients using a bar plot.

Code

food %>%
  mutate(Ingredients = strsplit(as.character(ingredients), ", ")) %>%
  unnest(Ingredients) %>%
  count(Ingredients) %>%
  arrange(desc(n)) |> slice_head(n=10) |> gt()

Table 5: Most Common Ingredients in Indian Dishes

Ingredients	n
sugar	44
ginger	29
garam masala	27
curry leaves	25
ghee	25
jaggery	19
urad dal	17
Rice flour	16
milk	15
tomato	15

Code

food %>%
  mutate(Ingredients = strsplit(as.character(ingredients), ", ")) %>%
  unnest(Ingredients) %>%
  count(Ingredients) %>%
  arrange(desc(n)) |> slice_head(n=10) |> 
  ggplot(aes(x = reorder(Ingredients, n), y = n, fill = Ingredients)) +  
  geom_col() +  
  labs(title = 'Most Common Ingredients in Indian Dishes', x = "", y = "") +
  coord_flip() + 
  theme_minimal() +
  theme(
    legend.position = "none",
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    axis.text.x = element_text(hjust = 0)  # Adjust text justification if necessary
  ) +
  scale_y_continuous(expand = expansion(mult = c(0, 0.05)))

Figure 7: Most Common Ingredients in Indian Dishes

Additional Visualizations and Summary Tables

Pie Chart of Dietary Preferences

Code

# Vegetarian vs Non-Vegetarian distribution
food %>%
  count(diet) %>%
  ggplot(aes(x = "", y = n, fill = diet)) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar("y") + 
    labs(title = "Diet Composition of Indian Foods") +
  theme_void()

Advanced Table Visualizations

Code

# Using `gt` to create advanced table visualizations for summary
food %>%
  count(course, flavor_profile) %>%
  gt()

course	flavor_profile	n
dessert	sweet	85
main course	-1	26
main course	bitter	3
main course	sour	1
main course	spicy	96
main course	sweet	3
snack	-1	3
snack	bitter	1
snack	spicy	35
starter	spicy	2

Summary Table of Food Categories

Code

tabyl(food, course, flavor_profile) |> gt()

course	-1	bitter	sour	spicy	sweet
dessert	0	0	0	0	85
main course	26	3	1	96	3
snack	3	1	0	35	0
starter	0	0	0	2	0

Text data analysis

Word Cloud of Ingredients

Code

# Word cloud of ingredients
library(tidyr)
library(wordcloud2)

food %>%
  # Assuming 'ingredients' is a column that contains a string of ingredients separated by commas
  mutate(Ingredients = strsplit(as.character(ingredients), ",\\s*")) %>%
  unnest(Ingredients) %>%
  count(Ingredients, name = "Frequency") %>%
  arrange(desc(Frequency)) %>%
  # Optionally slice the top 100 most frequent ingredients
   slice_head(n = 100) %>%
  wordcloud2(size = 0.5)  # Adjust the size parameter as needed

Figure 8: Word cloud of ingredients

References

Prabhavalkar, Neha. 2023. “Indian Food 101.” https://www.kaggle.com/datasets/nehaprabhavalkar/indian-food-101.