Teaching quantitative methods with a New Zealand synthetic unit record dataset

The 2018-2021 General Social Survey (GSS) Teaching Synthetic Unit Record File (SURF) is a dataset consisting of 1,997 synthetic respondents that, when variables are aggregated, produce population-level estimates for people aged 15 years and older in New Zealand in 2018 and 2021 similar to what would be estimated if a researcher was using the GSS microdata.

The dataset request and coordination was led by Kate Prickett and Phillip Worthington at Victoria University of Wellington. Resourcing support was provided for the 2018 data from the libraries of Victoria University of Wellington, University of Auckland, Auckland University of Technology, University of Canterbury, and University of Otago. Resourcing support was provided for the 2021 data from the Victoria University of Wellington library.

The original dataset was developed by Statistics New Zealand (StatsNZ).

Please read this ‘Read Me’ file for more information on this dataset and supporting documents.

The following page contains the data file (2018-2021), along with workshop materials and assignments (geared towards the 2018 data only) that can be used with the GSS SURF to teach bivariate descriptive statistics and multivariate OLS regression in an accessible way in Excel.

The purpose of these materials is to teach these statistical techniques in a way that the underlying intuition of the models is understood and model output can be translated.

 

Data files

The dataset consists of 1,997 unit records (997 in 2018 and 1,000 in 2021) and 40 variables. Variables consist of a range of sociodemographic factors (e.g., age, sex, ethnicity, educational attainment, family structure, labour force status, household income, disability, home ownership, migrant status) and wellbeing indicators (e.g., life satisfaction, self-rated physical health, mental wellbeing, housing conditions, residential mobility, feelings of loneliness), as well as other behaviours and experiences (e.g., trust in government, voting behaviour, experience of discrimination, access to public transit and green space, being a victim of crime). A final survey frequency weight is included in the ‘simple’ file, whereas the ‘with missingness’ and ‘original’ datasets also contain a set of replicate survey weights.

Datasets

  • GSS_SURF_2018_2021.csv/.dta: This dataset is best as an entry-level dataset. Some variables have been recoded for simplicity, and missing data, “don’t know” and “refused to say” responses have been imputed with values.

  • GSS_SURF_2018_2021_with missing.csv/.dta: This dataset is similar to the entry-level dataset, but retains “don’t know” and “refused to say” responses (i.e., does not impute these values).

  • GSS_SURF_2018_2021_ORIGINAL.txt: This is the original dataset provided by StatsNZ (the 2018 and 2021 survey years combined). It contains many non-numeric values that need to be converted to numeric values, “don’t know” values, etc. This is a good dataset for more advanced students that could benefit from working with how survey data typically look, including a set of replicate survey weights.

Dataset codebooks

Dataset construction code

  • GSS SURF – simple dataset creation.do: This is a Stata .do file that calls in the original StatsNZ SURF, conducts recoding (e.g., changing character values to numeric) and adds value labelling. Some variables are collapsed for ease of use (e.g., some top-coding, collapsing of categories), and “don’t know” and “refused to say” answer options imputed with values. These files contain annotation for understanding recode decisions. The code creates the final working dataset “GSS_SURF_2018_2021”.

  • GSS SURF – simple dataset creation with missings.do: This is Stata .do file creates the “with missings” data, retaining “don’t know” and “refused to say” responses (i.e., does not impute these values).

 

Teaching and assessment materials

These teaching materials are aimed at tertiary students who haven’t engaged with statistics in many years or have a low statistical baseline to begin with. They will also have limited experience using statistical formulas in Excel or creating charts from derived statistics. The focus is on application and policy translation of findings.

These materials were designed with the 2018 survey wave only, but can be adapted to include both the 2018 and 2021 survey waves.

Creating and presenting bivariate statistics

Conducting ANOVA/OLS linear regression

Conducting and presenting OLS multivariable linear regression