Code
# Load necessary libraries and read the data
library(tidyverse)
library(gt)
library(gtsummary)
library(janitor)
<- read_csv(here::here("data", "indian_food.csv")) food
In this analysis, we used the Indian Food 101 dataset from Kaggle (Prabhavalkar 2023).I will this data rathat I call it subcontinent
Food dataset, using from very basics of data handling, analysis, and visualization techniques with R and the Tidyverse suite. The dataset contains information about various Indian dishes, their ingredients, dietary preferences, and preparation times. We will explore the dataset to understand the distribution of dishes by course, flavor profile, and diet type. We will also analyze the preparation and cooking times of the dishes and identify the most common ingredients used in Indian cuisine.
# Load necessary libraries and read the data
library(tidyverse)
library(gt)
library(gtsummary)
library(janitor)
<- read_csv(here::here("data", "indian_food.csv")) food
After loading the data, we can inspect the structure, summary statistics, and the first few rows of the dataset. This helps us understand the variables and their types. There are 255 rows and 9 columns in the dataset.
First 6 rows of the dataset:
# Display the first few rows and summary statistics
|> head() food
# A tibble: 6 × 9
name ingredients diet prep_time cook_time flavor_profile course state region
<chr> <chr> <chr> <dbl> <dbl> <chr> <chr> <chr> <chr>
1 Balu… Maida flou… vege… 45 25 sweet desse… West… East
2 Boon… Gram flour… vege… 80 30 sweet desse… Raja… West
3 Gaja… Carrots, m… vege… 15 60 sweet desse… Punj… North
4 Ghev… Flour, ghe… vege… 15 30 sweet desse… Raja… West
5 Gula… Milk powde… vege… 15 40 sweet desse… West… East
6 Imar… Sugar syru… vege… 10 50 sweet desse… West… East
colnames(food)
[1] "name" "ingredients" "diet" "prep_time"
[5] "cook_time" "flavor_profile" "course" "state"
[9] "region"
It is important to understand the data types of each variable in the dataset. We can use the map_chr
function from the purrr
package to get the class of each variable.
|> map_chr(class) food
name ingredients diet prep_time cook_time
"character" "character" "character" "numeric" "numeric"
flavor_profile course state region
"character" "character" "character" "character"
From the output, we can see that the dataset contains a mix of character, factor, and numeric variables. One needs to use the appropriate data manipulation functions to handle each type of variable. For example, we can use bar plots to visualize the distribution of categorical variables and histograms for numeric variables. Once we have a good understanding of the data, we can proceed with exploratory data analysis (EDA) to uncover patterns and insights.
It is important to check for missing values in the dataset before performing any analysis. Missing values can affect the results of statistical tests and machine learning models. We can use the colSums
function to count the number of missing values in each column.
# Check for missing values in each column
colSums(is.na(food))
name ingredients diet prep_time cook_time
1 0 0 0 0
flavor_profile course state region
0 0 0 1
The output shows that there are missing values in the name
and region
columns. We need to handle these missing values before proceeding with the analysis. So lets see region
and think whether we can fill the missing values or not.
# Display unique values in the region column
%>%
food select(region) %>%
distinct() |> gt()
region |
---|
East |
West |
North |
-1 |
North East |
South |
Central |
NA |
There is one NA value and lets see this row to see if we can fill the missing value or not.
# Display the row with missing region value
%>%
food filter(is.na(region)) |> gt()
name | ingredients | diet | prep_time | cook_time | flavor_profile | course | state | region |
---|---|---|---|---|---|---|---|---|
Panjeeri | Whole wheat flour, musk melon seeds, poppy seeds, edible gum, semolina | vegetarian | 10 | 25 | sweet | dessert | Uttar Pradesh | NA |
One can see that region
is NA and state
is Uttar Pradesh
. We can fill the missing value with North
as Uttar Pradesh
is in North India. Here is another important data wrangligh verb mutate
which is used to create new columns or modify existing columns in the dataset. We can use the if_else
function to fill the missing region value with ‘North’.
# Fill missing region value with 'North'
<- food %>%
food mutate(region = if_else(is.na(region), "North", region))
|> filter(state=="Uttar Pradesh") |> gt() food
name | ingredients | diet | prep_time | cook_time | flavor_profile | course | state | region |
---|---|---|---|---|---|---|---|---|
Jalebi | Maida, corn flour, baking soda, vinegar, curd, water, turmeric, saffron, cardamom | vegetarian | 10 | 50 | sweet | dessert | Uttar Pradesh | North |
Petha | Firm white pumpkin, sugar, kitchen lime, alum powder | vegetarian | 10 | 30 | sweet | dessert | Uttar Pradesh | North |
Rabri | Condensed milk, sugar, spices, nuts | vegetarian | 10 | 45 | sweet | dessert | Uttar Pradesh | North |
Sohan halwa | Corn flour, ghee, dry fruits | vegetarian | 10 | 60 | sweet | dessert | Uttar Pradesh | North |
Kachori | Moong dal, rava, garam masala, dough, fennel seeds | vegetarian | 30 | 60 | spicy | snack | Uttar Pradesh | North |
Kofta | Paneer, potato, cream, corn flour, garam masala | vegetarian | 20 | 40 | spicy | main course | Uttar Pradesh | North |
Lauki ke kofte | Bottle gourd, garam masala powder, gram flour, ginger, chillies | vegetarian | 20 | 40 | spicy | main course | Uttar Pradesh | North |
Navrattan korma | Green beans, potatoes, khus khus, low fat, garam masala powder | vegetarian | 25 | 40 | spicy | main course | Uttar Pradesh | North |
Panjeeri | Whole wheat flour, musk melon seeds, poppy seeds, edible gum, semolina | vegetarian | 10 | 25 | sweet | dessert | Uttar Pradesh | North |
Perfect
we have replaced missing region value with North
as Uttar Pradesh
is in North India. Now we can proceed with the analysis.
We can use the count
function from the dplyr
package to count the number of entries in each category and arrange them in descending order. This helps us understand the distribution of dishes by course, flavor profile, and diet type.
%>%
food count(course) %>%
arrange(desc(n)) |> gt()
course | n |
---|---|
main course | 129 |
dessert | 85 |
snack | 39 |
starter | 2 |
Table 1 indicates that majory of the dishes are classified as main course
and dessert
, followed by snack
and starter
. We can also count the number of dishes by flavor profile and diet type.
Lets see the distribution of dishes by course and flavor profile using bar plots. We can use the ggplot2
package to create the plots. The geom_col
function is used to create bar plots, and the facet_wrap
function is used to create multiple plots based on a categorical variable.
%>%
food count(course, flavor_profile) %>%
ggplot(aes(x = reorder(course, -n), y = n, fill = flavor_profile)) +
geom_col() +
labs(title = 'Distribution of Courses by Flavor Profile') + #facet_wrap(~flavor_profile, scales = "free")+
coord_flip() + theme_minimal() + theme(legend.position = "none")
Many a times we use stacked bar plots
to visualize the distribution of data by variables. One of the drawback of stacked bar plots is that it is difficult to compare the counts of different categories as there is no common baseline.
Small multiples are a powerful visualization technique that allows us to compare multiple plots side by side. We can use the facet_wrap
function in ggplot2
to create small multiples based on a categorical variable. In this case, we will create small multiples for the distribution of dishes by course and diet type.
%>%
food count(course, flavor_profile) %>%
ggplot(aes(x = reorder(course, -n), y = n, fill = flavor_profile)) +
geom_col() +
labs(title = 'Distribution of Courses by Flavor Profile',x="",y="") + facet_wrap(~flavor_profile, scales = "free")+
coord_flip() + theme_minimal() + theme(legend.position = "none")
Now if wee compare the Figure 1 and Figure 2, we can see that small multiples are more effective in comparing the distribution of dishes by course and diet type. We can also create small multiples for other categorical variables to gain more insights from the data.
When there are two continuous variables in the dataset, we can use scatter plots to visualize the relationship between them. In this case, we can create a scatter plot of preparation time vs cooking time to see if there is any relationship between the two variables.
ggplot(food, aes(x = prep_time, y = cook_time, color = diet)) +
geom_point() +
labs(title = "Preparation Time vs Cooking Time") +
theme_minimal() + theme(legend.position = "top")
Figure 3 shows a scatter plot of preparation time vs cooking time for different diet types. We can see that there is no clear relationship between the two variables, and the points are scattered across the plot. One of the reasons maybe that there are some unusually large values in the dataset. One has to explore whether these values are valid or not.
We can use the slice_max
function from the dplyr
package to find the foods with the maximum preparation and cooking time. This helps us identify the dishes that take the longest time to prepare and cook.
%>% mutate(tot_time = prep_time + cook_time) |>
food slice_max(order_by = tot_time, n = 5) %>%
select(name, prep_time, cook_time, tot_time) |> gt()
name | prep_time | cook_time | tot_time |
---|---|---|---|
Shrikhand | 10 | 720 | 730 |
Pindi chana | 500 | 120 | 620 |
Puttu | 495 | 40 | 535 |
Misti doi | 480 | 30 | 510 |
Dosa | 360 | 90 | 450 |
Idli | 360 | 90 | 450 |
Masala Dosa | 360 | 90 | 450 |
Table 2 shows the foods with the maximum preparation and cooking time. We can see that Pindi Chana
has the maximum preparation time of 500 minutes, while Shrikhand
has the maximum cooking time of 720 minutes. We can also calculate the total time for each dish by adding the preparation and cooking time.
Lets filter
data by tot_time
less than 200 minutes and visualize the data using scatter plot
.
%>% mutate(tot_time = prep_time + cook_time) |>
food filter(tot_time < 200) |>
ggplot(aes(x = prep_time, y = cook_time, color = diet)) +
geom_point() +
labs(title = "Preparation Time vs Cooking Time (Total Time < 200 minutes)") +
theme_minimal() + theme(legend.position = "top")
We can use the group_by
and summarize
functions from the dplyr
package to group the data by state and region and calculate the total number of dishes in each group. This helps us understand the distribution of dishes by state and region.
%>%
food group_by(state) %>%
summarize(n = n()) %>%
arrange(desc(n)) |> slice_head(n=10) |> gt()
state | n |
---|---|
Gujarat | 35 |
Punjab | 32 |
Maharashtra | 30 |
-1 | 24 |
West Bengal | 24 |
Assam | 21 |
Tamil Nadu | 20 |
Andhra Pradesh | 10 |
Uttar Pradesh | 9 |
Kerala | 8 |
Table 3 shows the number of dishes by state. We can see that Gujarat
has the highest number of dishes, followed by Punjab
and Maharashtra
. We can also group the data by region and calculate the total number of dishes in each region.
%>%
food group_by(region) %>%
summarize(n = n()) %>%
arrange(desc(n)) |> slice_head(n=10) |> gt()
region | n |
---|---|
West | 74 |
South | 59 |
North | 50 |
East | 31 |
North East | 25 |
-1 | 13 |
Central | 3 |
Table 4 shows the number of dishes by region. We can see that West
region has the highest number of dishes, followed by North
and South
regions. We can also visualize the distribution of dishes by state and region using bar plots.
We can visualize the distribution of dishes by state and region using bar plots. We can use the geom_col
function in ggplot2
to create bar plots of the number of dishes by state and region.
%>%
food group_by(state) %>%
summarize(n = n()) %>%
arrange(desc(n)) |> slice_head(n=10) |>
ggplot(aes(x = reorder(state, n), y = n, fill = state)) +
geom_col() + # geom_text(aes(label = n), vjust = -0.5, size = 3) +
labs(title = 'Number of Dishes by State', x = "", y = "") +
coord_flip() +
theme_minimal() +
theme(
legend.position = "none",
axis.title.x = element_blank(),
axis.title.y = element_blank(),
axis.text.x = element_text(hjust = 0) # Adjust text justification if necessary
+
) scale_y_continuous(expand = expansion(mult = c(0, 0.05)))
%>%
food group_by(region) %>%
summarize(n = n()) %>%
arrange(desc(n)) |> slice_head(n=10) |>
ggplot(aes(x = reorder(region, n), y = n, fill = region)) +
geom_col() + # geom_text(aes(label = n), vjust = -0.5, size = 3) +
labs(title = 'Number of Dishes by Region', x = "", y = "") +
coord_flip() +
theme_minimal() +
theme(
legend.position = "none",
axis.title.x = element_blank(),
axis.title.y = element_blank(),
axis.text.x = element_text(hjust = 0) # Adjust text justification if necessary
+
) scale_y_continuous(expand = expansion(mult = c(0, 0.05)))
We can analyze the ingredients used in Indian dishes to identify the most common ingredients. We can split the ingredients column into individual ingredients, count the frequency of each ingredient, and visualize the most common ingredients using a bar plot.
%>%
food mutate(Ingredients = strsplit(as.character(ingredients), ", ")) %>%
unnest(Ingredients) %>%
count(Ingredients) %>%
arrange(desc(n)) |> slice_head(n=10) |> gt()
Ingredients | n |
---|---|
sugar | 44 |
ginger | 29 |
garam masala | 27 |
curry leaves | 25 |
ghee | 25 |
jaggery | 19 |
urad dal | 17 |
Rice flour | 16 |
milk | 15 |
tomato | 15 |
%>%
food mutate(Ingredients = strsplit(as.character(ingredients), ", ")) %>%
unnest(Ingredients) %>%
count(Ingredients) %>%
arrange(desc(n)) |> slice_head(n=10) |>
ggplot(aes(x = reorder(Ingredients, n), y = n, fill = Ingredients)) +
geom_col() +
labs(title = 'Most Common Ingredients in Indian Dishes', x = "", y = "") +
coord_flip() +
theme_minimal() +
theme(
legend.position = "none",
axis.title.x = element_blank(),
axis.title.y = element_blank(),
axis.text.x = element_text(hjust = 0) # Adjust text justification if necessary
+
) scale_y_continuous(expand = expansion(mult = c(0, 0.05)))
# Vegetarian vs Non-Vegetarian distribution
%>%
food count(diet) %>%
ggplot(aes(x = "", y = n, fill = diet)) +
geom_bar(stat = "identity", width = 1) +
coord_polar("y") +
labs(title = "Diet Composition of Indian Foods") +
theme_void()
# Using `gt` to create advanced table visualizations for summary
%>%
food count(course, flavor_profile) %>%
gt()
course | flavor_profile | n |
---|---|---|
dessert | sweet | 85 |
main course | -1 | 26 |
main course | bitter | 3 |
main course | sour | 1 |
main course | spicy | 96 |
main course | sweet | 3 |
snack | -1 | 3 |
snack | bitter | 1 |
snack | spicy | 35 |
starter | spicy | 2 |
tabyl(food, course, flavor_profile) |> gt()
course | -1 | bitter | sour | spicy | sweet |
---|---|---|---|---|---|
dessert | 0 | 0 | 0 | 0 | 85 |
main course | 26 | 3 | 1 | 96 | 3 |
snack | 3 | 1 | 0 | 35 | 0 |
starter | 0 | 0 | 0 | 2 | 0 |
# Word cloud of ingredients
library(tidyr)
library(wordcloud2)
%>%
food # Assuming 'ingredients' is a column that contains a string of ingredients separated by commas
mutate(Ingredients = strsplit(as.character(ingredients), ",\\s*")) %>%
unnest(Ingredients) %>%
count(Ingredients, name = "Frequency") %>%
arrange(desc(Frequency)) %>%
# Optionally slice the top 100 most frequent ingredients
slice_head(n = 100) %>%
wordcloud2(size = 0.5) # Adjust the size parameter as needed