Applied Econometrics Using R, Stata, and Python

Organiser: Pakistan Economic Forum

Author
Affiliation

Zahid Asghar

Quaid-i-Azam University

Published

February 10, 2025

This workshop on applied econometrics is reused by Prof Zahid Asghar for with permission from Prof. Hans H. Sivevertsen , University of Bristol for R codes while STATA and Python codes will be created by Zahid Asghar himself.

How this works


1 Research question & data

You will only see the above content in the output.

1.1 The research question

Our goal is to answer the following fictitious research question:

Does attending a summer school improve test scores?

The research question will be addressed using a fictitious simulated dataset.

1.2 The setting

The research question is inspired by papers such as Matsudaira (2007) and the survey on interventions for low SES students by by Dietrichson et al ( 2017).

The fictitious setting is as follows:

  • In the summer break between year 5 and year 6, (roughly corresponding to age 10) there is an optional summer school.
  • The summer school could be focusing on the school curriculum, or it could be focused on skills that lead to improved schooling outcomes (for example “grit” as in Alan et al (2019)).
  • The summer school is free, but enrollment requires active involvement by parents.
  • We are interested in whether participation in the summer school improves child outcomes.

1.3 The data

We have three datasets to study the research question:

  1. school_data_1.csv
  • We use this as example on how to load data stored in a csv format.
  • This dataset contains information about person id, school id, an indicator variable that takes the value of 1 if the individual participated in the summer school, information about gender, parental income and parental schooling, and test scores in year 5 (before the treatment) and year 6.
  1. school_data_2.dta
  • We use this as example on how to load data stored in Stata format.
  • This dataset contains information about person id, which enables us to link it to the first dataset. We will use this to practice merging data.
  • The dataset also contains information about whether the individual received a reminder letter.
  1. school_data_3.xlsx
  • We use this as example on how to load data stored in a Microsoft Excel format.
  • This dataset contains information about person id, which enables us to link it to the first dataset.
  • The dataset also contains information about test scores in earlier (<5) and later years (>6).

Let’s get started!







2 Loading & merging the data

2.1 Installing and loading a “package”

The first item on our to do list is to load the datasets. The first dataset is in a csv format. There are several ways to load a csv document into R. I am going to use read_csv() from the readr package. Before we can use this package we need to install it. We install a package with the install.packages() function, where we insert the name of the package in parenthesis. This procedure corresponds to ssc install outreg to install outreg in Stata.

An important difference to Stata is that we also have to tell R to use the new package in every new session. We do that with library(). However we only have to install it once. So to install and load readr we run the following command:

📌 Load Required Libraries

require("emo")
Loading required package: emo
Warning in library(package, lib.loc = lib.loc, character.only = TRUE,
logical.return = TRUE, : there is no package called 'emo'
library("tidyverse")
Warning: package 'lubridate' was built under R version 4.4.2
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library("skimr")
library("readstata13")
library("sjPlot")
library("psych")

Attaching package: 'psych'

The following objects are masked from 'package:ggplot2':

    %+%, alpha
library("openxlsx")
library("dplyr")

# Set knitr options
knitr::opts_chunk$set(echo = FALSE)

📌 Load CSV Data

Rows: 3491 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (8): person_id, school_id, summercamp, female, parental_schooling, paren...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 6 × 8
  person_id school_id summercamp female parental_schooling parental_lincome
      <dbl>     <dbl>      <dbl>  <dbl>              <dbl>            <dbl>
1         1         5          0      1                 10             12.9
2         2        14          1      0                 11             14.7
3         3         7          1      0                 14             16.1
4         4         8          0      0                 12             14.6
5         5         9          1      0                 11             13.8
6         6        26          1      1                 11             14.7
# ℹ 2 more variables: test_year_5 <dbl>, test_year_6 <dbl>

📌 Load Stata Data

     person_id letter
3484      3484      0
3485      3485      0
3486      3486      1
3487      3487      0
3488      3488      1
3489      3489      0
3490      3490      0
3491      3491      0

📌 Load Excel Data

Rows: 3,491
Columns: 10
$ person_id    <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17…
$ test_year_2  <dbl> 1.090117, 1.246309, 2.726472, 2.693032, 1.660545, 2.22377…
$ test_year_3  <dbl> 1.914594, 1.154470, 2.269011, 2.413203, 1.828067, 2.27566…
$ test_year_4  <dbl> 2.065805, 1.582455, 3.247252, 1.479452, 1.361972, 2.38510…
$ test_year_7  <dbl> 2.377697, 1.747376, 3.017764, 2.637954, 1.904636, 3.37613…
$ test_year_8  <dbl> 2.032904, 2.444041, 3.361646, 3.021940, 2.109774, 3.24542…
$ test_year_9  <dbl> 1.493803, 1.663050, 3.387020, 2.761513, 2.285818, 2.96503…
$ test_year_10 <dbl> 1.880512, 1.833769, 2.968617, 2.088086, 1.845694, 3.30819…
$ learnings    <dbl> 10.236394, 8.278911, 8.966529, 8.876466, 8.770518, 10.484…
$ school_id    <dbl> 5, 14, 7, 8, 9, 26, 13, 11, 23, 9, 25, 15, 3, 4, 17, 7, 1…

📌 Load Stata Dataset

{stata} * Import File After Having Right Working Directory cd “D:_econ_with_r”

import delimited “D:_econ_with_r_data_1.csv”,clear

📌 Import Required Libraries

{python} import pandas as pd

📌 Load CSV Data in Python

{python} # Load data school_data_1 = pd.read_csv(“data/school_data_1.csv”)

Show first 5 rows

school_data_1.head()

📌 Load Stata Data in Python

{python} # Load Stata dataset school_data_2 = pd.read_stata(“data/school_data_2.dta”)

Show last 5 rows

school_data_2.tail()

📌 Load Excel Data in Python

{python} # Load Excel dataset school_data_3 = pd.read_excel(“data/school_data_3.xlsx”)

Show dataset info

school_data_3.info()

# load data and assign it to an object with the name school_data_1
school_data_1<-read_csv("data/school_data_1.csv")
Rows: 3491 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (8): person_id, school_id, summercamp, female, parental_schooling, paren...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

{stata} import delimited “D:_econ_with_r_data_1.csv”,clear

import excel “D:_econ_with_r_data_3.xlsx”, sheet(“Sheet 1”) firstrow clear

{python} # Load Stata dataset school_data_2 = pd.read_stata(“data/school_data_2.dta”)

Show last 5 rows

school_data_2.tail()