Show code
<- read_excel("D:/RepTemplates/data_analytics/data/tosha_khana_uzain_younas.xlsx") tk_raw
Learning Data Cleaning and Analysis with Messy Real-World Data
Educational Disclaimer: This analysis demonstrates data cleaning techniques using public records. Gift exchanges are normal diplomatic protocol. All retentions mentioned are per data available. This tutorial focuses on analytical methods, not policy evaluation.
This tutorial uses Pakistan’s Toshakhana (official gift repository) records to teach data analysis because it has every problem you’ll face with real-world data:
Learning Goals: By the end, you’ll know how to systematically clean messy data, handle temporal complexity, verify results, and communicate findings with appropriate context.
Let’s see what we’re working with:
<- read_excel("D:/RepTemplates/data_analytics/data/tosha_khana_uzain_younas.xlsx") tk_raw
S. No. | Detail of Gifts | Name of Recipient | Date | Assessed Value | Retention cost | Remarks |
---|---|---|---|---|---|---|
1 | Hand cut crystal glass | President of Pakistan, Dr. Arif Alvi | 2023-01-02 | NA | NA | In process |
2 | One Wool Carpet | Prime Minister of Pakistan, Mian Muhammad Shehbaz Sharif | 2023-01-04 | NA | NA | In process |
3 | One Ceramic set | Minister for Finance and Revenue | 2023-01-06 | NA | NA | In process |
4 | One Vase | Minister for Finance and Revenue | 2023-01-06 | NA | NA | In process |
5 | One Ear Ring One Locket One Ring | Lt. Gen. (R) Nigar Johar, Ex-Surgeon General Medical Services | 2023-01-06 | NA | NA | In process |
6 | One W/Watch (Rolex) | Agha Hassan Baloch, Minister for Science and Technology | 2023-01-10 | NA | NA | In process |
7 | One W/Watch Rolex | Minister for Defence, Khawaja Muhammad Asif | 2023-01-17 | NA | NA | In process (Recipient has desired to display the gift at PM House) |
8 | One W/Watch Rolex | Minister for Commerce, Syed Naveed Qamar | 2023-01-17 | NA | NA | In process |
9 | One vase | President of Pakistan, Dr. Arif Alvi | 2023-01-18 | NA | NA | In process |
10 | One W/Watch Rolex | Minister for Board of Investment, Ch. Salik Hussain | 2023-01-18 | NA | NA | In process |
Problems visible immediately:
"500,000/- 300,000/-"
(two gifts, one cell)Same person appears multiple ways:
Same Person, Six Different Spellings | |
---|---|
How It Appears | Problem |
Gen Pervaiz Musharraf, President of Pakistan | Short |
Mr. Pervaiz Musharraf President of Pakistan | Alt spelling |
President of Pakistan, General Pervaiz Musharraf | Another spelling |
First Lady of Pakistan, wife of President Pervaiz Musharraf | Full v1 |
Gen Parvez Musharaf, President of Pakistan | Full v2 |
First lady, wife of Gen Parvez Musharaf, President of Pakistan | With title |
Impact: Without fixing, we’d count 1 person as 6 different people.
Multiple Gifts Crammed in One Cell | ||
---|---|---|
Recipient | Values (Multiple!) | Date |
Dr. Arif Alvi, President of Pakistan | Rs.20,000/- | 2022-01-03 |
First lady of Pakistan | Rs.16,000/- | 2022-01-03 |
Dr. Arif Alvi, President of Pakistan | a. (i) Rs.16,000/- (ii) Rs.14,000/- b. Rs.10,000/- c. Rs.20,000/- | 2022-01-03 |
NA | Total: - Rs.60,000/- | NA |
Mr. Sohail Mehmood, Foreign Secretary | Rs.15,000/- | 2022-01-05 |
"500,000/- 300,000/-"
= TWO separate gifts worth Rs. 800K total, not one gift.
<- tk_raw |>
tk_clean clean_names() |> # Fix column names
select(-c(x8, x9, x10)) |> # Remove empty columns
drop_na(assessed_value) |> # Drop missing values
mutate(name_of_recipient = str_squish(name_of_recipient)) # Remove extra spaces
Result: 4387 rows with clean column names
<- tk_clean |>
tk_processed mutate(assessed_value = str_split(assessed_value, "/-")) |> # Split on delimiter
unnest(assessed_value) |> # One row per value
mutate(
assessed_value_numeric = str_extract(assessed_value, "\\d+,?\\d+,?\\d+"),
assessed_value_numeric = parse_number(assessed_value_numeric)
)
Result: 9518 rows (expanded from 4387)
<- tk_processed |>
tk_processed filter(
!is.na(assessed_value_numeric),
> 100000, # Focus on substantial gifts
assessed_value_numeric == "Amount deposited, gift retained" # Only retained items
remarks |>
) mutate(
retention_numeric = parse_number(str_remove_all(retention_cost, ",")),
year_gift = year(date)
)
Result: 393 gifts >Rs. 100K that were retained
<- c("Imran Khan", "Nawaz Sharif", "Mian Muhammad Nawaz Sharif",
pm_patterns "Shehbaz Sharif", "Shahbaz Sharif", "Shaukat Aziz",
"Shahid Khaqan Abbasi", "Khaqan Abbasi", "Yousaf Raza Gillani",
"Raja Pervez Ashraf", "Pervez Ashraf")
<- c("Arif Alvi", "Dr. Arif Alvi", "Asif Ali Zardari",
president_patterns "Mamnoon Hussain", "Pervez Musharraf", "Parvez Musharaf", "Musharaf")
<- c("Khawaja Muhammad Asif", "Ahsan Iqbal",
key_ministers "Shah Mehmood Qureshi", "Ishaq Dar", "Sheikh Rashid Ahmad")
<- tk_processed |>
tk_final mutate(
recipient_name = case_when(
str_detect(name_of_recipient, regex(paste0(pm_patterns, collapse = "|"), ignore_case = TRUE)) ~
str_extract(name_of_recipient, regex(paste0(pm_patterns, collapse = "|"), ignore_case = TRUE)),
str_detect(name_of_recipient, regex(paste0(president_patterns, collapse = "|"), ignore_case = TRUE)) ~
str_extract(name_of_recipient, regex(paste0(president_patterns, collapse = "|"), ignore_case = TRUE)),
TRUE ~ NA_character_
),recipient_name = case_when(
str_detect(recipient_name, regex("Shehbaz|Shahbaz", ignore_case = TRUE)) ~ "Shehbaz Sharif",
str_detect(recipient_name, regex("Khaqan", ignore_case = TRUE)) ~ "Shahid Khaqan Abbasi",
str_detect(recipient_name, regex("Musharraf|Musharaf", ignore_case = TRUE)) ~ "Pervez Musharraf",
TRUE ~ recipient_name
),# TEMPORAL LOGIC: Check role at time of gift
category = case_when(
str_detect(recipient_name, "Shahid Khaqan Abbasi") & year_gift >= 2017 ~ "Prime Minister",
str_detect(recipient_name, "Shahid Khaqan Abbasi") & year_gift < 2017 ~ "Minister",
str_detect(name_of_recipient, regex(paste0(pm_patterns, collapse = "|"), ignore_case = TRUE)) ~ "Prime Minister",
str_detect(name_of_recipient, regex(paste0(president_patterns, collapse = "|"), ignore_case = TRUE)) ~ "President",
# Check specific before general
str_detect(name_of_recipient, regex("Ambassador|High Commissioner", ignore_case = TRUE)) ~ "Ambassador",
str_detect(name_of_recipient, regex("\\bSecretary\\b", ignore_case = TRUE)) ~ "Secretary",
str_detect(name_of_recipient, regex("Consul General", ignore_case = TRUE)) ~ "Consul General",
str_detect(name_of_recipient, regex("Minister", ignore_case = TRUE)) ~ "Minister",
TRUE ~ "Other"
),full_name_extract = if_else(category %in% c("Ambassador", "Secretary", "Minister"),
str_extract(name_of_recipient, "^[^,]+"), recipient_name)
)
Key insight: The temporal split for one individual who was Minister (2015-16) then PM (2017-18) prevents miscategorization.
Same Person, Different Roles Over Time | ||
---|---|---|
category | assessed_value_numeric | date |
Minister | 850,000 | 4/6/2015 |
Minister | 582,000 | 4/6/2015 |
Minister | 750,000 | 2/15/2016 |
Prime Minister | 290,000 | 3/9/2017 |
Prime Minister | 325,000 | 4/5/2017 |
Prime Minister | 15,000,000 | 10/17/2017 |
Prime Minister | 1,200,000 | 10/17/2017 |
Prime Minister | 1,500,000 | 10/17/2017 |
|> filter(category != "Other") |>
category_summary ggplot(aes(x = fct_reorder(category, total_value), y = total_value, fill = category)) +
geom_col(show.legend = FALSE, alpha = 0.85) +
geom_text(aes(label = comma(total_value, prefix = "Rs. ")), hjust = -0.1, size = 3.5, fontface = "bold") +
coord_flip() +
scale_y_continuous(labels = comma, expand = expansion(mult = c(0, 0.15))) +
scale_fill_brewer(palette = "Set2") +
labs(title = "Total Value of Retained Toshakhana Gifts",
subtitle = "Officials paid to retain these gifts per Toshakhana rules (2005-2023)",
x = NULL, y = "Total Value (PKR)") +
theme(plot.title = element_text(face = "bold", size = 14), panel.grid.major.y = element_blank())
Summary Statistics | |||
---|---|---|---|
Category | Gifts | Total (Rs.) | Average (Rs.) |
Prime Minister | 114 | 493,874,759 | 4,332,235 |
Minister | 92 | 123,307,975 | 1,340,304 |
President | 41 | 78,932,595 | 1,925,185 |
Secretary | 36 | 28,572,697 | 793,686 |
Ambassador | 13 | 16,610,000 | 1,277,692 |
Consul General | 2 | 3,010,000 | 1,505,000 |
Interpretation: Prime Ministers (Rs. 494M, 114 gifts) lead due to frequent diplomatic engagements. Presidents have longer tenures but fewer gifts (more ceremonial role).
|>
pm_summary mutate(label = paste0("Rs. ", comma(total_value), "\n", n_gifts, " gifts | ",
format(first_gift, "%Y"), "-", format(last_gift, "%Y"))) |>
ggplot(aes(x = fct_reorder(recipient_name, total_value), y = total_value)) +
geom_col(fill = "steelblue", alpha = 0.8) +
geom_text(aes(label = label), hjust = -0.1, size = 2.8, lineheight = 0.9) +
coord_flip() +
scale_y_continuous(labels = comma, expand = expansion(mult = c(0, 0.2))) +
labs(title = "Prime Ministers: Gift Values by Tenure",
subtitle = "One individual appears with 2017-18 gifts (PM tenure); their 2015-16 gifts in Ministers category",
x = NULL, y = "Total Value (PKR)") +
theme(plot.title = element_text(face = "bold"), panel.grid.major.y = element_blank())
Context: Normalizing by Time | ||||
---|---|---|---|---|
Name | Gifts | Total (Rs.) | Per Month (Rs.) | Period |
Shahid Khaqan Abbasi | 26 | 230,370,000 | 198 | Mar 2017 - Apr 2018 |
Nawaz Sharif | 28 | 127,527,319 | 34 | Nov 2013 - May 2017 |
Imran Khan | 18 | 109,026,100 | 33 | Sep 2018 - Nov 2021 |
Shaukat Aziz | 28 | 17,951,720 | 5 | Jun 2004 - Sep 2007 |
Yousaf Raza Gillani | 7 | 6,018,620 | 3 | Dec 2009 - Oct 2011 |
Raja Pervez Ashraf | 4 | 2,325,000 | Inf | Nov 2012 - Nov 2012 |
Shehbaz Sharif | 3 | 656,000 | Inf | May 2022 - May 2022 |
Key finding: Highest total (Rs. 230M over 18 months) = ~Rs. 13M/month .
Top 5 in Each Bureaucratic Category | |||
---|---|---|---|
Role | Name | Gifts | Total (Rs.) |
Minister | Khawaja Muhammad Asif | 12 | 58,343,788 |
Minister | Brig. Waseem Iftikhar Cheema | 7 | 6,445,000 |
Minister | Senator Muhammad Ishaq Dar | 4 | 6,350,000 |
Minister | SP Rana Shoaib Mehmood | 3 | 6,150,000 |
Minister | Begum Prime Minister of Pakistan | 4 | 5,859,000 |
Secretary | Mr. Fawad Hasan Fawad | 6 | 10,985,000 |
Secretary | Sahebzada Ahmed Khan | 6 | 5,282,750 |
Secretary | Brig. Akmal Aziz | 2 | 3,350,000 |
Secretary | Brig Waseem Iftikhar Cheema | 2 | 1,925,000 |
Secretary | Sahebzada A. Khan | 1 | 1,300,000 |
Ambassador | Khan Hasham bin Saddique | 4 | 5,910,000 |
Ambassador | Mr. Khan Hasham bin Saddique | 1 | 3,550,000 |
Ambassador | Vice Admiral (R) Khan Hasham bin Saddique | 2 | 2,450,000 |
Ambassador | Mr. Manzoor ul Haq | 1 | 1,500,000 |
Ambassador | Vice Admiral (R) Khan Hasham Bin Saddique | 1 | 980,000 |
Impact of fixing categorization order: 49 bureaucrats (13 Ambassadors, 36 Secretaries) correctly identified instead of miscategorized as “Ministers.”
Problem: One person held two roles at different times. Total: Rs. 231M.
Solution: Split by date—2015-16 gifts (Rs. 1.3M) → Ministers; 2017-18 gifts (Rs. 230M) → Prime Ministers
Why: Different roles = different protocols. Mixing creates false comparisons.
Built comprehensive pattern lists checking Wikipedia, news archives, government records. Result: Same person appearing 6 ways now correctly counted as one.
Bug: Checked “Minister” before “Ambassador” → Ambassadors hidden
Fix: Check specific (Ambassador, Secretary) before general (Minister)
This analysis does NOT include:
This does NOT show:
Context matters: Longer tenure = more gifts; diplomatic calendars vary; gift-giving cultures differ by country.
Expect: inconsistent names, multiple values per cell, temporal complexity. Your job: systematically clean while documenting choices.
Built into this analysis: row counts at each stage, spot-checks, duplicate detection, temporal verification, category cross-checks.
Rs. 230M shocking? Context: 26 gifts / 18 months / high diplomatic activity / Rs. 102M paid to retain / all legal.
Check specific patterns before general ones. Lesson cost: 49 miscategorized bureaucrats initially.
Transformed: 500+ messy rows → 393 verified gifts → 7 categories → Rs. 807M total
Key techniques: Systematic cleaning • Temporal logic • Name standardization • Verification systems • Honest limitations
What this teaches: Real data analysis = making defensible decisions with imperfect information while being transparent.
Tutorial: 2025-10-01 | Tools: R, tidyverse, Quarto