Welcome to my page for STATS 506 Statistical Computing
Assignment 1 STATA
- Problem 1a uses STATA to convert data between long and wide format. Problem 1b and Problem 1c use STATA to perform logistic and robust regression.
- Problem 2 uses STATA to perform data analysis on RECS dataset, I will use other computing languages to perform the same analysis in subsequent assignments (dplyr in Assignment 2, data.table in Assignment 3).
- Problem 3 uses STATA to analyze National Health and Nutrition Examination Survey (NHANES) dataset, in particular how different demographic groups may have different results in audiometry test. (Note: This data set will be used again in Assignment 4 but with SAS)
Assignment 2 R: dplyr, multidimensional scaling, and Monte Carlo
- Problem 1 uses dplyr package to analysize RECS data about roof types of residential building in different states.
- Problem 2 performs text cleaning (fuzzy matching) and multidimensional scaling on a study of interaction between animals.
- Problem 3 uses Monte Carlo simulation to estimate pi.
- Problem 4 uses Monte Carlo simulation to compare 2 different ways of computing confidence interval, one via bootstrap and one via robust estimator.
Assignment 3 R: data.table package, and visualization
- Problem 1 uses data.table package in R in accomplish the same task as Problem 1 of Assignment 2.
- Problem 2 uses data.table package to analyze flight data of New York City major airports via split-apply-combine approach.
- Problem 3 uses R rvest package for web scraping to get distance between major airports in the US.
- Problem 4 uses ggmap function in ggplot2 package to produce visualization of flight between major airports in the US, and use multidimensional scaling to compare flight schedule patterns of major airlines.
Assgignment 4 SAS, SQL and parallel computing
- Problem 1 uses SAS to analyze National Health and Nutrition Examination Survey (NHANES) dataset, in particular how different demographic groups may have different results in audiometry test. (this is similar to Assignment 1 Problem 3).
- Problem 2 uses lme4 package from R to run mixed model on NHANES data.
- Problem 3 uses SAS to run SQL to analyze US medicare payment data.
- Problem 4 uses parallel computing on Flux (University of Michigan high-performance computing cluster) to estimate out-of-sample prediction error using cross-validation in R. PBS script1 PBS script2
Individual Project: Analysis of Los Angeles Business
- Geographic Distribution of Businesses in Los Angeles report & code
- What are declining and evergreen businesses in LA? report & code
Group Project
Tutorial of K-means with visualization and performance evaluation in R, Python and STATA