Toshakhana Data Analysis

Learning Data Cleaning and Analysis with Messy Real-World Data

Author

Zahid Asghar

Published

October 1, 2025

Caution

Educational Disclaimer: This analysis demonstrates data cleaning techniques using public records. Gift exchanges are normal diplomatic protocol. All retentions mentioned are per data available. This tutorial focuses on analytical methods, not policy evaluation.

Toshakhana

1 Introduction: Why This Dataset?

This tutorial uses Pakistan’s Toshakhana (official gift repository) records to teach data analysis because it has every problem you’ll face with real-world data:

  • Inconsistent naming (same person, 6 spellings)
  • Multiple values in single cells
  • Unstructured text mixing names and titles
  • People changing roles over time
  • Missing context requiring research

Learning Goals: By the end, you’ll know how to systematically clean messy data, handle temporal complexity, verify results, and communicate findings with appropriate context.


2 The Raw Data: Not Machine-Ready

Let’s see what we’re working with:

Show code
tk_raw <- read_excel("D:/RepTemplates/data_analytics/data/tosha_khana_uzain_younas.xlsx")
First 10 rows—notice the problems
S. No. Detail of Gifts Name of Recipient Date Assessed Value Retention cost Remarks
1 Hand cut crystal glass President of Pakistan, Dr. Arif Alvi 2023-01-02 NA NA In process
2 One Wool Carpet Prime Minister of Pakistan, Mian Muhammad Shehbaz Sharif 2023-01-04 NA NA In process
3 One Ceramic set Minister for Finance and Revenue 2023-01-06 NA NA In process
4 One Vase Minister for Finance and Revenue 2023-01-06 NA NA In process
5 One Ear Ring One Locket One Ring Lt. Gen. (R) Nigar Johar, Ex-Surgeon General Medical Services 2023-01-06 NA NA In process
6 One W/Watch (Rolex) Agha Hassan Baloch, Minister for Science and Technology 2023-01-10 NA NA In process
7 One W/Watch Rolex Minister for Defence, Khawaja Muhammad Asif 2023-01-17 NA NA In process (Recipient has desired to display the gift at PM House)
8 One W/Watch Rolex Minister for Commerce, Syed Naveed Qamar 2023-01-17 NA NA In process
9 One vase President of Pakistan, Dr. Arif Alvi 2023-01-18 NA NA In process
10 One W/Watch Rolex Minister for Board of Investment, Ch. Salik Hussain 2023-01-18 NA NA In process

Problems visible immediately:

  1. Column names have spaces: “Name of Recipient” (not machine-friendly)
  2. Multiple values in cells: "500,000/- 300,000/-" (two gifts, one cell)
  3. Mixed text: titles + names + positions all jumbled
  4. Inconsistent formatting throughout

2.1 Problem #1: Name Chaos

Same person appears multiple ways:

Same Person, Six Different Spellings
How It Appears Problem
Gen Pervaiz Musharraf, President of Pakistan Short
Mr. Pervaiz Musharraf President of Pakistan Alt spelling
President of Pakistan, General Pervaiz Musharraf Another spelling
First Lady of Pakistan, wife of President Pervaiz Musharraf Full v1
Gen Parvez Musharaf, President of Pakistan Full v2
First lady, wife of Gen Parvez Musharaf, President of Pakistan With title

Impact: Without fixing, we’d count 1 person as 6 different people.

2.2 Problem #2: Multiple Values Per Cell

Multiple Gifts Crammed in One Cell
Recipient Values (Multiple!) Date
Dr. Arif Alvi, President of Pakistan Rs.20,000/- 2022-01-03
First lady of Pakistan Rs.16,000/- 2022-01-03
Dr. Arif Alvi, President of Pakistan a. (i) Rs.16,000/- (ii) Rs.14,000/- b. Rs.10,000/- c. Rs.20,000/- 2022-01-03
NA Total: - Rs.60,000/- NA
Mr. Sohail Mehmood, Foreign Secretary Rs.15,000/- 2022-01-05

"500,000/- 300,000/-" = TWO separate gifts worth Rs. 800K total, not one gift.


3 The Cleaning Pipeline

3.1 Step 1: Basic Hygiene

Show code
tk_clean <- tk_raw |> 
  clean_names() |>              # Fix column names
  select(-c(x8, x9, x10)) |>    # Remove empty columns
  drop_na(assessed_value) |>    # Drop missing values
  mutate(name_of_recipient = str_squish(name_of_recipient))  # Remove extra spaces

Result: 4387 rows with clean column names

3.2 Step 2: Unnest Multiple Values

Show code
tk_processed <- tk_clean |> 
  mutate(assessed_value = str_split(assessed_value, "/-")) |>  # Split on delimiter
  unnest(assessed_value) |>                                     # One row per value
  mutate(
    assessed_value_numeric = str_extract(assessed_value, "\\d+,?\\d+,?\\d+"),
    assessed_value_numeric = parse_number(assessed_value_numeric)
  )

Result: 9518 rows (expanded from 4387)

3.3 Step 3: Filter Analysis Scope

Show code
tk_processed <- tk_processed |> 
  filter(
    !is.na(assessed_value_numeric),
    assessed_value_numeric > 100000,                    # Focus on substantial gifts
    remarks == "Amount deposited, gift retained"        # Only retained items
  ) |> 
  mutate(
    retention_numeric = parse_number(str_remove_all(retention_cost, ",")),
    year_gift = year(date)
  )

Result: 393 gifts >Rs. 100K that were retained


4 The Categorization Challenge

4.1 Define Pattern Lists

Show code
pm_patterns <- c("Imran Khan", "Nawaz Sharif", "Mian Muhammad Nawaz Sharif",
  "Shehbaz Sharif", "Shahbaz Sharif", "Shaukat Aziz",
  "Shahid Khaqan Abbasi", "Khaqan Abbasi", "Yousaf Raza Gillani", 
  "Raja Pervez Ashraf", "Pervez Ashraf")

president_patterns <- c("Arif Alvi", "Dr. Arif Alvi", "Asif Ali Zardari",
  "Mamnoon Hussain", "Pervez Musharraf", "Parvez Musharaf", "Musharaf")

key_ministers <- c("Khawaja Muhammad Asif", "Ahsan Iqbal", 
  "Shah Mehmood Qureshi", "Ishaq Dar", "Sheikh Rashid Ahmad")

4.2 Apply Smart Categorization

Show code
tk_final <- tk_processed |> 
  mutate(
    recipient_name = case_when(
      str_detect(name_of_recipient, regex(paste0(pm_patterns, collapse = "|"), ignore_case = TRUE)) ~
        str_extract(name_of_recipient, regex(paste0(pm_patterns, collapse = "|"), ignore_case = TRUE)),
      str_detect(name_of_recipient, regex(paste0(president_patterns, collapse = "|"), ignore_case = TRUE)) ~
        str_extract(name_of_recipient, regex(paste0(president_patterns, collapse = "|"), ignore_case = TRUE)),
      TRUE ~ NA_character_
    ),
    recipient_name = case_when(
      str_detect(recipient_name, regex("Shehbaz|Shahbaz", ignore_case = TRUE)) ~ "Shehbaz Sharif",
      str_detect(recipient_name, regex("Khaqan", ignore_case = TRUE)) ~ "Shahid Khaqan Abbasi",
      str_detect(recipient_name, regex("Musharraf|Musharaf", ignore_case = TRUE)) ~ "Pervez Musharraf",
      TRUE ~ recipient_name
    ),
    # TEMPORAL LOGIC: Check role at time of gift
    category = case_when(
      str_detect(recipient_name, "Shahid Khaqan Abbasi") & year_gift >= 2017 ~ "Prime Minister",
      str_detect(recipient_name, "Shahid Khaqan Abbasi") & year_gift < 2017 ~ "Minister",
      str_detect(name_of_recipient, regex(paste0(pm_patterns, collapse = "|"), ignore_case = TRUE)) ~ "Prime Minister",
      str_detect(name_of_recipient, regex(paste0(president_patterns, collapse = "|"), ignore_case = TRUE)) ~ "President",
      # Check specific before general
      str_detect(name_of_recipient, regex("Ambassador|High Commissioner", ignore_case = TRUE)) ~ "Ambassador",
      str_detect(name_of_recipient, regex("\\bSecretary\\b", ignore_case = TRUE)) ~ "Secretary",
      str_detect(name_of_recipient, regex("Consul General", ignore_case = TRUE)) ~ "Consul General",
      str_detect(name_of_recipient, regex("Minister", ignore_case = TRUE)) ~ "Minister",
      TRUE ~ "Other"
    ),
    full_name_extract = if_else(category %in% c("Ambassador", "Secretary", "Minister"), 
                                 str_extract(name_of_recipient, "^[^,]+"), recipient_name)
  )

Key insight: The temporal split for one individual who was Minister (2015-16) then PM (2017-18) prevents miscategorization.

Same Person, Different Roles Over Time
category assessed_value_numeric date
Minister 850,000 4/6/2015
Minister 582,000 4/6/2015
Minister 750,000 2/15/2016
Prime Minister 290,000 3/9/2017
Prime Minister 325,000 4/5/2017
Prime Minister 15,000,000 10/17/2017
Prime Minister 1,200,000 10/17/2017
Prime Minister 1,500,000 10/17/2017

5 Results: What the Data Shows

5.1 Overview: Rs. 807 Million Across Categories

Show code
category_summary |> filter(category != "Other") |> 
  ggplot(aes(x = fct_reorder(category, total_value), y = total_value, fill = category)) +
  geom_col(show.legend = FALSE, alpha = 0.85) +
  geom_text(aes(label = comma(total_value, prefix = "Rs. ")), hjust = -0.1, size = 3.5, fontface = "bold") +
  coord_flip() +
  scale_y_continuous(labels = comma, expand = expansion(mult = c(0, 0.15))) +
  scale_fill_brewer(palette = "Set2") +
  labs(title = "Total Value of Retained Toshakhana Gifts",
       subtitle = "Officials paid to retain these gifts per Toshakhana rules (2005-2023)",
       x = NULL, y = "Total Value (PKR)") +
  theme(plot.title = element_text(face = "bold", size = 14), panel.grid.major.y = element_blank())
Figure 1: Diplomatic roles drive gift frequency
Table 1
Summary Statistics
Category Gifts Total (Rs.) Average (Rs.)
Prime Minister 114 493,874,759 4,332,235
Minister 92 123,307,975 1,340,304
President 41 78,932,595 1,925,185
Secretary 36 28,572,697 793,686
Ambassador 13 16,610,000 1,277,692
Consul General 2 3,010,000 1,505,000

Interpretation: Prime Ministers (Rs. 494M, 114 gifts) lead due to frequent diplomatic engagements. Presidents have longer tenures but fewer gifts (more ceremonial role).

5.2 Prime Ministers: Temporal Logic in Action

Show code
pm_summary |> 
  mutate(label = paste0("Rs. ", comma(total_value), "\n", n_gifts, " gifts | ",
                        format(first_gift, "%Y"), "-", format(last_gift, "%Y"))) |> 
  ggplot(aes(x = fct_reorder(recipient_name, total_value), y = total_value)) +
  geom_col(fill = "steelblue", alpha = 0.8) +
  geom_text(aes(label = label), hjust = -0.1, size = 2.8, lineheight = 0.9) +
  coord_flip() +
  scale_y_continuous(labels = comma, expand = expansion(mult = c(0, 0.2))) +
  labs(title = "Prime Ministers: Gift Values by Tenure",
       subtitle = "One individual appears with 2017-18 gifts (PM tenure); their 2015-16 gifts in Ministers category",
       x = NULL, y = "Total Value (PKR)") +
  theme(plot.title = element_text(face = "bold"), panel.grid.major.y = element_blank())
Figure 2: Tenure length and diplomatic calendar create variation
Table 2
Context: Normalizing by Time
Name Gifts Total (Rs.) Per Month (Rs.) Period
Shahid Khaqan Abbasi 26 230,370,000 198 Mar 2017 - Apr 2018
Nawaz Sharif 28 127,527,319 34 Nov 2013 - May 2017
Imran Khan 18 109,026,100 33 Sep 2018 - Nov 2021
Shaukat Aziz 28 17,951,720 5 Jun 2004 - Sep 2007
Yousaf Raza Gillani 7 6,018,620 3 Dec 2009 - Oct 2011
Raja Pervez Ashraf 4 2,325,000 Inf Nov 2012 - Nov 2012
Shehbaz Sharif 3 656,000 Inf May 2022 - May 2022

Key finding: Highest total (Rs. 230M over 18 months) = ~Rs. 13M/month .

5.3 Bureaucrats: Recovered Categories

Table 3
Top 5 in Each Bureaucratic Category
Role Name Gifts Total (Rs.)
Minister Khawaja Muhammad Asif 12 58,343,788
Minister Brig. Waseem Iftikhar Cheema 7 6,445,000
Minister Senator Muhammad Ishaq Dar 4 6,350,000
Minister SP Rana Shoaib Mehmood 3 6,150,000
Minister Begum Prime Minister of Pakistan 4 5,859,000
Secretary Mr. Fawad Hasan Fawad 6 10,985,000
Secretary Sahebzada Ahmed Khan 6 5,282,750
Secretary Brig. Akmal Aziz 2 3,350,000
Secretary Brig Waseem Iftikhar Cheema 2 1,925,000
Secretary Sahebzada A. Khan 1 1,300,000
Ambassador Khan Hasham bin Saddique 4 5,910,000
Ambassador Mr. Khan Hasham bin Saddique 1 3,550,000
Ambassador Vice Admiral (R) Khan Hasham bin Saddique 2 2,450,000
Ambassador Mr. Manzoor ul Haq 1 1,500,000
Ambassador Vice Admiral (R) Khan Hasham Bin Saddique 1 980,000

Impact of fixing categorization order: 49 bureaucrats (13 Ambassadors, 36 Secretaries) correctly identified instead of miscategorized as “Ministers.”


6 Three Critical Decisions

6.1 1. Temporal Logic (The “Time Traveler” Solution)

Problem: One person held two roles at different times. Total: Rs. 231M.

Solution: Split by date—2015-16 gifts (Rs. 1.3M) → Ministers; 2017-18 gifts (Rs. 230M) → Prime Ministers

Why: Different roles = different protocols. Mixing creates false comparisons.

6.2 2. Name Standardization (47 Variations)

Built comprehensive pattern lists checking Wikipedia, news archives, government records. Result: Same person appearing 6 ways now correctly counted as one.

6.3 3. Categorization Order

Bug: Checked “Minister” before “Ambassador” → Ambassadors hidden
Fix: Check specific (Ambassador, Secretary) before general (Minister)


7 Limitations (Be Honest)

This analysis does NOT include:

  • Gifts fully deposited (not retained)
  • Gifts <Rs. 100K
  • Complete pre-2010 records
  • Inflation adjustments

This does NOT show:

  • Impropriety (all legal under Toshakhana rules)
  • Personal preferences (reflects protocols)
  • Policy violations (followed required payments)

Context matters: Longer tenure = more gifts; diplomatic calendars vary; gift-giving cultures differ by country.


8 Learning Takeaways

8.1 1. Real Data is Messy

Expect: inconsistent names, multiple values per cell, temporal complexity. Your job: systematically clean while documenting choices.

8.2 2. Verify Everything

Built into this analysis: row counts at each stage, spot-checks, duplicate detection, temporal verification, category cross-checks.

8.3 3. Context Changes Meaning

Rs. 230M shocking? Context: 26 gifts / 18 months / high diplomatic activity / Rs. 102M paid to retain / all legal.

8.4 4. Order Matters

Check specific patterns before general ones. Lesson cost: 49 miscategorized bureaucrats initially.


9 Conclusion

Transformed: 500+ messy rows → 393 verified gifts → 7 categories → Rs. 807M total

Key techniques: Systematic cleaning • Temporal logic • Name standardization • Verification systems • Honest limitations

What this teaches: Real data analysis = making defensible decisions with imperfect information while being transparent.


Resources

Code: Click “Show code” buttons throughout
Questions: zahid.asghar@gmail.com
Purpose: Educational demonstration of data cleaning techniques

This analysis demonstrates handling messy real-world data systematically.


Tutorial: 2025-10-01 | Tools: R, tidyverse, Quarto