Introduction
US News reports that the United States has nearly 4000 degree-granting institutions as of 2020. In addition to degree-granting colleges and universities, there are hundreds of other post-secondary schools that provide students with certifications and professional licenses. With such a wide variety of post-secondary options, students have to pay close attention to the various distinctions that make these schools unique. Although the US Department of Education collects data on hundreds of institutional attributes, students only really consider a few of these variables. Some of the common considerations are ranking, size, diversity, student life, and program success.
One variable that stands out is whether or not a school is a single-sex institution. Given the history of mixed-sex education and the progression of women’s rights in the United States, single-sex education might seem like a relic of the past, but there are still a few dozen institutions that are committed to keeping genders separate.
With this in mind, this project will try to examine some of the key differences between co-ed and single-sex colleges. We will be using the US Department of Education’s college-scorecard data set, which contains all of the possible institution-level data one could ask for.
This is a very long analysis, so for a summary of all the findings, skip to the conclusion.
Data Preparation
Here, let’s unpack the data set
by selecting relevant
variables.
setwd("C:/Users/laryl/Desktop/Data Sets/College Score Card")
library(data.table)
<- fread("Most-Recent-Cohorts-All-Data-Elements.csv",
college_scorecard_raw select = c('INSTNM', 'MENONLY', 'WOMENONLY','CURROPER', # Institutional Attributes
'LATITUDE','LONGITUDE', # Geography
'HBCU','PBI', 'ANNHI', 'TRIBAL', 'AANAPII',
'HSI','NANTI','RELAFFIL', # Categorical Demographics
'UGDS', 'UGDS_WHITE', 'UGDS_BLACK', 'UGDS_HISP','UGDS_ASIAN',
'UGDS_AIAN','UGDS_NHPI','UGDS_2MOR','UGDS_NRA','UGDS_UNKN', # Numerical Demographics
'ADM_RATE', 'ADM_RATE_ALL', 'SATVRMID', 'SATMTMID',
'SATWRMID', 'ACTCMMID', 'ACTENMID',
'ACTMTMID', 'ACTWRMID', 'SAT_AVG', 'SAT_AVG_ALL', # Admission Statistics
'PCIP01', 'PCIP03', 'PCIP04', 'PCIP05', 'PCIP09', 'PCIP10', 'PCIP11', 'PCIP12',
'PCIP13', 'PCIP14', 'PCIP15', 'PCIP16', 'PCIP19', 'PCIP22',
'PCIP23', 'PCIP24', 'PCIP25', 'PCIP26', 'PCIP27',
'PCIP29', 'PCIP30', 'PCIP31', 'PCIP38', 'PCIP39',
'PCIP40', 'PCIP41', 'PCIP42', 'PCIP43','PCIP44', 'PCIP45', 'PCIP46', 'PCIP47', 'PCIP48', 'PCIP49', 'PCIP50', 'PCIP51', 'PCIP52', 'PCIP54', # Percentage of Majors
'MN_EARN_WNE_P10', 'MD_EARN_WNE_P10', # Earnings After College
'AVGFACSAL', 'PFTFAC', # Faculty Employment
'C150_4','C150_4_WHITE', 'C150_4_BLACK', 'C150_4_HISP',
'C150_4_ASIAN', 'C150_4_AIAN', 'C150_4_NHPI',
'C150_4_2MOR', 'C150_4_NRA', 'C150_4_UNKN', # Completion
'RET_FT4_POOLED', #Retention
'MEDIAN_HH_INC', 'LN_MEDIAN_HH_INC'))# Household Income
str(college_scorecard_raw)
## Classes 'data.table' and 'data.frame': 6806 obs. of 90 variables:
## $ INSTNM : chr "Alabama A & M University" "University of Alabama at Birmingham" "Amridge University" "University of Alabama in Huntsville" ...
## $ MENONLY : chr "0" "0" "0" "0" ...
## $ WOMENONLY : chr "0" "0" "0" "0" ...
## $ CURROPER : int 1 1 1 1 1 1 1 1 1 1 ...
## $ LATITUDE : chr "34.783368" "33.505697" "32.362609" "34.724557" ...
## $ LONGITUDE : chr "-86.568502" "-86.799345" "-86.17401" "-86.640449" ...
## $ HBCU : chr "1" "0" "0" "0" ...
## $ PBI : chr "0" "0" "1" "0" ...
## $ ANNHI : chr "0" "0" "0" "0" ...
## $ TRIBAL : chr "0" "0" "0" "0" ...
## $ AANAPII : chr "0" "0" "0" "0" ...
## $ HSI : chr "0" "0" "0" "0" ...
## $ NANTI : chr "0" "0" "0" "0" ...
## $ RELAFFIL : chr "NULL" "NULL" "74" "NULL" ...
## $ UGDS : chr "4990" "13186" "351" "7458" ...
## $ UGDS_WHITE : chr "0.0186" "0.5717" "0.2393" "0.7167" ...
## $ UGDS_BLACK : chr "0.912" "0.2553" "0.7151" "0.0969" ...
## $ UGDS_HISP : chr "0.0088" "0.0334" "0.0171" "0.0528" ...
## $ UGDS_ASIAN : chr "0.0018" "0.0633" "0.0057" "0.0381" ...
## $ UGDS_AIAN : chr "0.0022" "0.0034" "0.0057" "0.0095" ...
## $ UGDS_NHPI : chr "0.0016" "0.0002" "0" "0.0008" ...
## $ UGDS_2MOR : chr "0.0118" "0.0457" "0" "0.0296" ...
## $ UGDS_NRA : chr "0.007" "0.0213" "0" "0.0223" ...
## $ UGDS_UNKN : chr "0.0361" "0.0058" "0.0171" "0.0333" ...
## $ ADM_RATE : chr "0.8986" "0.9211" "NULL" "0.8087" ...
## $ ADM_RATE_ALL : chr "0.8986" "0.9211" "NULL" "0.8087" ...
## $ SATVRMID : chr "475" "555" "NULL" "630" ...
## $ SATMTMID : chr "465" "555" "NULL" "565" ...
## $ SATWRMID : chr "414" "NULL" "NULL" "NULL" ...
## $ ACTCMMID : chr "18" "25" "NULL" "28" ...
## $ ACTENMID : chr "17" "27" "NULL" "30" ...
## $ ACTMTMID : chr "17" "23" "NULL" "27" ...
## $ ACTWRMID : chr "NULL" "NULL" "NULL" "NULL" ...
## $ SAT_AVG : chr "957" "1220" "NULL" "1314" ...
## $ SAT_AVG_ALL : chr "957" "1220" "NULL" "1314" ...
## $ PCIP01 : chr "0.0394" "0" "0" "0" ...
## $ PCIP03 : chr "0.0237" "0" "0" "0" ...
## $ PCIP04 : chr "0.0039" "0" "0" "0" ...
## $ PCIP05 : chr "0" "0.0016" "0" "0" ...
## $ PCIP09 : chr "0" "0.0375" "0" "0.0194" ...
## $ PCIP10 : chr "0.0394" "0" "0" "0" ...
## $ PCIP11 : chr "0.0592" "0.0139" "0" "0.059" ...
## $ PCIP12 : chr "0" "0" "0" "0" ...
## $ PCIP13 : chr "0.071" "0.0717" "0" "0.0283" ...
## $ PCIP14 : chr "0.1183" "0.0813" "0" "0.2892" ...
## $ PCIP15 : chr "0.0197" "0" "0" "0" ...
## $ PCIP16 : chr "0" "0.004" "0" "0.017" ...
## $ PCIP19 : chr "0.0394" "0" "0.1846" "0" ...
## $ PCIP22 : chr "0" "0" "0" "0" ...
## $ PCIP23 : chr "0.0158" "0.0207" "0" "0.0153" ...
## $ PCIP24 : chr "0.0473" "0.0351" "0.0308" "0" ...
## $ PCIP25 : chr "0" "0" "0" "0" ...
## $ PCIP26 : chr "0.0927" "0.0876" "0" "0.0436" ...
## $ PCIP27 : chr "0.0059" "0.0112" "0" "0.0153" ...
## $ PCIP29 : chr "0" "0" "0" "0" ...
## $ PCIP30 : chr "0" "0" "0" "0.0008" ...
## $ PCIP31 : chr "0.002" "0" "0" "0.021" ...
## $ PCIP38 : chr "0" "0.0064" "0" "0.0024" ...
## $ PCIP39 : chr "0" "0" "0.2154" "0" ...
## $ PCIP40 : chr "0.0355" "0.0235" "0" "0.0307" ...
## $ PCIP41 : chr "0" "0.0008" "0" "0" ...
## $ PCIP42 : chr "0.0631" "0.0602" "0" "0.0202" ...
## $ PCIP43 : chr "0.0572" "0.0267" "0.1077" "0" ...
## $ PCIP44 : chr "0.0493" "0.0263" "0" "0" ...
## $ PCIP45 : chr "0.0355" "0.0315" "0" "0.0242" ...
## $ PCIP46 : chr "0" "0" "0" "0" ...
## $ PCIP47 : chr "0" "0" "0" "0" ...
## $ PCIP48 : chr "0" "0" "0" "0" ...
## $ PCIP49 : chr "0" "0" "0" "0" ...
## $ PCIP50 : chr "0.0237" "0.0339" "0" "0.038" ...
## $ PCIP51 : chr "0" "0.2255" "0" "0.1543" ...
## $ PCIP52 : chr "0.1578" "0.1908" "0.4615" "0.2108" ...
## $ PCIP54 : chr "0" "0.01" "0" "0.0105" ...
## $ MN_EARN_WNE_P10 : chr "35500" "48400" "47600" "52000" ...
## $ MD_EARN_WNE_P10 : chr "31000" "41200" "39600" "46700" ...
## $ AVGFACSAL : chr "7101" "10717" "4292" "9442" ...
## $ PFTFAC : chr "0.7411" "0.7766" "1" "0.6544" ...
## $ C150_4 : chr "0.2685" "0.5829" "0.4" "0.5187" ...
## $ C150_4_WHITE : chr "0.25" "0.5769" "0.6667" "0.5417" ...
## $ C150_4_BLACK : chr "0.2681" "0.5255" "0" "0.3956" ...
## $ C150_4_HISP : chr "0.25" "0.62" "NULL" "0.5714" ...
## $ C150_4_ASIAN : chr "NULL" "0.789" "NULL" "0.619" ...
## $ C150_4_AIAN : chr "NULL" "0.5" "NULL" "0.4286" ...
## $ C150_4_NHPI : chr "0" "1" "NULL" "NULL" ...
## $ C150_4_2MOR : chr "0.25" "0.6" "NULL" "0.2222" ...
## $ C150_4_NRA : chr "NULL" "0.7647" "NULL" "0.5" ...
## $ C150_4_UNKN : chr "0.375" "0.5" "0" "0.5882" ...
## $ RET_FT4_POOLED : chr "0.5978" "0.8303" "0.2143" "0.8269" ...
## $ MEDIAN_HH_INC : chr "49720.22" "55735.22" "53683.7" "58688.62" ...
## $ LN_MEDIAN_HH_INC: chr "10.75" "10.8599996566772" "10.8400001525878" "10.9300003051757" ...
## - attr(*, ".internal.selfref")=<externalptr>
Clearly, 90 variables is a lot to consider. But as we move through
this analysis, some of the variables will be eliminated or fused with
other variables. For now, our concern is improving the data type of
these attributes. For whatever reason, the fread function
unpacked many of the numeric variables in the data set as character
variables. So our first step will be to change the attributes to their
appropriate types.
<- college_scorecard_raw[, c(5:6, 15:90) :=lapply(.SD, as.numeric), .SDcols= c(5:6, 15:90 )]
college_scorecard_clean
str(college_scorecard_clean)
## Classes 'data.table' and 'data.frame': 6806 obs. of 90 variables:
## $ INSTNM : chr "Alabama A & M University" "University of Alabama at Birmingham" "Amridge University" "University of Alabama in Huntsville" ...
## $ MENONLY : chr "0" "0" "0" "0" ...
## $ WOMENONLY : chr "0" "0" "0" "0" ...
## $ CURROPER : int 1 1 1 1 1 1 1 1 1 1 ...
## $ LATITUDE : num 34.8 33.5 32.4 34.7 32.4 ...
## $ LONGITUDE : num -86.6 -86.8 -86.2 -86.6 -86.3 ...
## $ HBCU : chr "1" "0" "0" "0" ...
## $ PBI : chr "0" "0" "1" "0" ...
## $ ANNHI : chr "0" "0" "0" "0" ...
## $ TRIBAL : chr "0" "0" "0" "0" ...
## $ AANAPII : chr "0" "0" "0" "0" ...
## $ HSI : chr "0" "0" "0" "0" ...
## $ NANTI : chr "0" "0" "0" "0" ...
## $ RELAFFIL : chr "NULL" "NULL" "74" "NULL" ...
## $ UGDS : num 4990 13186 351 7458 3903 ...
## $ UGDS_WHITE : num 0.0186 0.5717 0.2393 0.7167 0.0167 ...
## $ UGDS_BLACK : num 0.912 0.2553 0.7151 0.0969 0.9352 ...
## $ UGDS_HISP : num 0.0088 0.0334 0.0171 0.0528 0.0095 0.0499 0.0239 0.0311 0.0117 0.0342 ...
## $ UGDS_ASIAN : num 0.0018 0.0633 0.0057 0.0381 0.0041 0.0116 0.0041 0.0073 0.0221 0.0236 ...
## $ UGDS_AIAN : num 0.0022 0.0034 0.0057 0.0095 0.0013 0.0035 0.0041 0.0143 0.004 0.0038 ...
## $ UGDS_NHPI : num 0.0016 0.0002 0 0.0008 0.0005 0.001 0 0.0011 0.0007 0.0005 ...
## $ UGDS_2MOR : num 0.0118 0.0457 0 0.0296 0.0102 0.0338 0.0058 0.0212 0.0393 0.0225 ...
## $ UGDS_NRA : num 0.007 0.0213 0 0.0223 0.0102 0.0183 0.0017 0 0.0488 0.049 ...
## $ UGDS_UNKN : num 0.0361 0.0058 0.0171 0.0333 0.0123 0.0045 0.0017 0.0322 0.0084 0.0036 ...
## $ ADM_RATE : num 0.899 0.921 NA 0.809 0.977 ...
## $ ADM_RATE_ALL : num 0.899 0.921 NA 0.809 0.977 ...
## $ SATVRMID : num 475 555 NA 630 480 590 NA NA 540 615 ...
## $ SATMTMID : num 465 555 NA 565 465 580 NA NA 525 615 ...
## $ SATWRMID : num 414 NA NA NA NA 540 NA NA NA 570 ...
## $ ACTCMMID : num 18 25 NA 28 18 27 NA NA 21 28 ...
## $ ACTENMID : num 17 27 NA 30 17 29 NA NA 21 29 ...
## $ ACTMTMID : num 17 23 NA 27 17 25 NA NA 19 26 ...
## $ ACTWRMID : num NA NA NA NA NA 8 NA NA NA 8 ...
## $ SAT_AVG : num 957 1220 NA 1314 972 ...
## $ SAT_AVG_ALL : num 957 1220 NA 1314 972 ...
## $ PCIP01 : num 0.0394 0 0 0 0 0 0 0 0 0.0414 ...
## $ PCIP03 : num 0.0237 0 0 0 0 0.0053 0 0 0.0217 0.0202 ...
## $ PCIP04 : num 0.0039 0 0 0 0 0 0 0 0 0.0181 ...
## $ PCIP05 : num 0 0.0016 0 0 0 0.0027 0 0 0 0 ...
## $ PCIP09 : num 0 0.0375 0 0.0194 0.0892 0.0973 0 0 0.0339 0.0566 ...
## $ PCIP10 : num 0.0394 0 0 0 0 0 0 0 0 0 ...
## $ PCIP11 : num 0.0592 0.0139 0 0.059 0.0585 ...
## $ PCIP12 : num 0 0 0 0 0 0 0.0092 0 0 0 ...
## $ PCIP13 : num 0.071 0.0717 0 0.0283 0.1169 ...
## $ PCIP14 : num 0.1183 0.0813 0 0.2892 0 ...
## $ PCIP15 : num 0.0197 0 0 0 0 0 0.0552 0 0 0 ...
## $ PCIP16 : num 0 0.004 0 0.017 0 0.0058 0 0 0.0041 0.0062 ...
## $ PCIP19 : num 0.0394 0 0.1846 0 0 ...
## $ PCIP22 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ PCIP23 : num 0.0158 0.0207 0 0.0153 0.0123 0.0093 0 0.0369 0.0136 0.0102 ...
## $ PCIP24 : num 0.0473 0.0351 0.0308 0 0 ...
## $ PCIP25 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ PCIP26 : num 0.0927 0.0876 0 0.0436 0.0831 ...
## $ PCIP27 : num 0.0059 0.0112 0 0.0153 0.0169 0.0076 0 0.0224 0.0041 0.0058 ...
## $ PCIP29 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ PCIP30 : num 0 0 0 0.0008 0.02 0.0148 0.0529 0.0171 0.0217 0.02 ...
## $ PCIP31 : num 0.002 0 0 0.021 0.0108 0 0 0.0092 0.0474 0 ...
## $ PCIP38 : num 0 0.0064 0 0.0024 0 0.0016 0 0.0066 0 0.0021 ...
## $ PCIP39 : num 0 0 0.215 0 0 ...
## $ PCIP40 : num 0.0355 0.0235 0 0.0307 0.0231 0.0114 0 0.0013 0.0176 0.0092 ...
## $ PCIP41 : num 0e+00 8e-04 0e+00 0e+00 0e+00 0e+00 0e+00 0e+00 0e+00 0e+00 ...
## $ PCIP42 : num 0.0631 0.0602 0 0.0202 0.06 0.0376 0 0.0211 0.0379 0.0279 ...
## $ PCIP43 : num 0.0572 0.0267 0.1077 0 0.0938 ...
## $ PCIP44 : num 0.0493 0.0263 0 0 0.0646 0.0092 0 0 0 0.016 ...
## $ PCIP45 : num 0.0355 0.0315 0 0.0242 0.0138 0.0364 0 0.0211 0.0407 0.0337 ...
## $ PCIP46 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ PCIP47 : num 0 0 0 0 0 0 0.0322 0 0 0 ...
## $ PCIP48 : num 0 0 0 0 0 ...
## $ PCIP49 : num 0e+00 0e+00 0e+00 0e+00 0e+00 0e+00 0e+00 0e+00 0e+00 2e-04 ...
## $ PCIP50 : num 0.0237 0.0339 0 0.038 0.0585 0.0244 0 0.0303 0.0298 0.0296 ...
## $ PCIP51 : num 0 0.226 0 0.154 0.168 ...
## $ PCIP52 : num 0.158 0.191 0.462 0.211 0.106 ...
## $ PCIP54 : num 0 0.01 0 0.0105 0.0046 0.0124 0 0.0079 0.0068 0.0087 ...
## $ MN_EARN_WNE_P10 : num 35500 48400 47600 52000 30600 51600 32400 42400 38000 56300 ...
## $ MD_EARN_WNE_P10 : num 31000 41200 39600 46700 27700 44500 27700 38700 33300 48800 ...
## $ AVGFACSAL : num 7101 10717 4292 9442 7754 ...
## $ PFTFAC : num 0.741 0.777 1 0.654 0.583 ...
## $ C150_4 : num 0.269 0.583 0.4 0.519 0.3 ...
## $ C150_4_WHITE : num 0.25 0.577 0.667 0.542 0.4 ...
## $ C150_4_BLACK : num 0.268 0.525 0 0.396 0.295 ...
## $ C150_4_HISP : num 0.25 0.62 NA 0.571 0.316 ...
## $ C150_4_ASIAN : num NA 0.789 NA 0.619 1 ...
## $ C150_4_AIAN : num NA 0.5 NA 0.429 0 ...
## $ C150_4_NHPI : num 0 1 NA NA 0 0.75 NA NA 0 NA ...
## $ C150_4_2MOR : num 0.25 0.6 NA 0.222 0.3 ...
## $ C150_4_NRA : num NA 0.765 NA 0.5 0.588 ...
## $ C150_4_UNKN : num 0.375 0.5 0 0.588 0.214 ...
## $ RET_FT4_POOLED : num 0.598 0.83 0.214 0.827 0.59 ...
## $ MEDIAN_HH_INC : num 49720 55735 53684 58689 46065 ...
## $ LN_MEDIAN_HH_INC: num 10.8 10.9 10.8 10.9 10.7 ...
## - attr(*, ".internal.selfref")=<externalptr>
Our current data set has two columns that tell us the sex-specific status of our institutions (“MENONLY” and “WOMENONLY). Let’s take a look at what values these variables contain:
table(college_scorecard_clean$MENONLY)
##
## 0 1 NULL
## 6270 61 475
table(college_scorecard_clean$WOMENONLY)
##
## 0 1 NULL
## 6296 35 475
Although this is great information and we can see which colleges are male-only and female-only, there are a few problems. First, there are null values that are not beneficial to us. The second problem is the fact that we don’t need the variables to be separate from each other. That separation is going to make creating graphs later slightly more challenging. Lastly, there is no easy way to view co-ed colleges, as both columns only tell us whether a college is for a specific sex or not. This means there isn’t a clear way to quickly distinguish co-ed colleges from the other sex-specific colleges when using one of the columns.
Thus, our next block will eliminate the null values, fuse the MENONLY and WOMENONLY columns, and create a new category called “CO-ED.” We will also make sure the colleges are operational.
# Load data manipulation packages
library(dplyr)
library(tidyr)
# Get rid of nulls in both columns
<- college_scorecard_clean[MENONLY %in% c(0,1)][WOMENONLY %in% c(0,1)]
college_scorecard_null_less
#Rename values so that instead of 1 = sex-specific and 0 = the other options, we have it clearly tell us that it the sex or other options
<- college_scorecard_null_less %>%
college_scorecard_rewrittenmutate(MENONLY = if_else(MENONLY == 0, 'NA', 'MENONLY')) %>%
mutate(WOMENONLY = if_else(WOMENONLY == 0, 'NA', 'WOMENONLY'))
# Fuse columns
<- unite(college_scorecard_rewritten, MENONLY, WOMENONLY, col = "SEX_SPECIFIC", sep = "_")
college_scorecard_fused
# Rewrite categories so that we have are three distinct categories
<- college_scorecard_fused %>%
college_scorecard_prepped mutate(SEX_SPECIFIC = dplyr::recode(SEX_SPECIFIC,
'MENONLY_NA' = 'MEN ONLY',
'NA_NA' = 'CO-ED',
'NA_WOMENONLY'= 'WOMEN ONLY' ))
# Filter for colleges that are operational
<- college_scorecard_prepped[CURROPER == "1"]
college_scorecard_completely_prepped
table(college_scorecard_completely_prepped[["SEX_SPECIFIC"]])
##
## CO-ED MEN ONLY WOMEN ONLY
## 5950 61 35
As we can see, there are 3 distinct categories in a single variable. We can now begin to get some insights from the data.
At a Glance
Before going any further, we should get a good glance at what our data set visually looks like. This may reveal information that might not be as apparent when looking at raw statistics. I think this is the perfect opportunity to create a map that shows the location of the colleges filtered by sex-specificity.
#Create three data sets, one for each sex-specific category
<- college_scorecard_prepped[SEX_SPECIFIC == "CO-ED"]
Coed<- college_scorecard_prepped[SEX_SPECIFIC == "MEN ONLY"]
Men_only <- college_scorecard_prepped[SEX_SPECIFIC == "WOMEN ONLY"] Women_only
Let’s create the map:
#Load map packages
library(leaflet)
library(leaflet.extras)
library(htmltools)
#Create Map
<- leaflet() %>%
map addProviderTiles("CartoDB") %>%
addCircleMarkers(data = Men_only,
radius = 1,
color = "#1298d0",
label = ~htmlEscape(Men_only[["INSTNM"]]),
group = "MEN ONLY") %>%
addCircleMarkers(data = Women_only,
radius = 1,
color = "#cc8834",
label = ~htmlEscape(Women_only[["INSTNM"]]),
group = "WOMEN ONLY") %>%
addCircleMarkers(data = Coed,
radius = 1,
color = "#f1cb35",
label = ~htmlEscape(Coed[["INSTNM"]]),
group = "CO-ED") %>%
addLayersControl(overlayGroups = c("MEN ONLY", "WOMEN ONLY", "CO-ED"))%>%
setView(lat = 39.8282, lng = -98.5795, zoom = 4)
#Display Map
map
This interactive map displays quite a bit of information. First, we can see the difference in the number of colleges there are in each of the categories by changing the layers in the top right corner. Additionally, we can see that most of the colleges that are sex-specific are concentrated along the eastern side of the United States. Moreover, as you hover over the male-only colleges you can see that many of them are religiously affiliated.
Key Variables
Now, we can begin to analyze some of the variables that might distinguish the sex-specific categories.
The variables are:
- Demographics (Categorically and Numerically)
- Admission Statistics
- Percentage of Degrees Awarded in Academic Divisions
- Completion Rates by Ethnicities
- Retention Rates
- Faculty Employment Information
- Household Income
Demographics
Starting with one of the categorical demographic variables, there are 8 columns of interest:
- HBCU (Flag for Historically Black College or University)
- PBI (Flag for predominantly black institution)
- ANNHI (Flag for Alaska Native Hawaiian serving institution)
- TRIBAL (Flag for tribal college and university)
- AANAPII (Flag for Asian American, Native American, and Pacific Islander-serving institution)
- HSI (Flag for Hispanic-serving institution)
- NANTI (Flag for Native American non-tribal institution)
- RELAFFIL (Religious affiliation of the institution)
Let’s take a look it look at what values one of these columns has.
table(college_scorecard_completely_prepped$PBI)
##
## 0 1 NULL
## 5949 94 3
Like the sex-specific variables above, we have null values which don’t give us useful information. Moreover, because there are so many categories, we don’t want to create a chart for each one. Instead, we will be fusing the first 7 variables that are flagged for particular ethnicities, just like we did for the sex-specific categories.
Additionally, because the religious column has dozens of categories, creating a chart for all of them does not make sense. Instead, we are going to collapse all of the categories into religious and non-religious.
# Data Manipulations
library(forcats)
# Change values so that they are strings
<- college_scorecard_completely_prepped %>%
college_scorecard_cat_demographics_rw mutate(HBCU = dplyr::recode(HBCU,
'0' = 'Not HBCU',
'1' = 'HBCU',
'NULL'= 'Not HBCU' )) %>%
mutate(PBI = dplyr::recode(PBI,
'0' = 'Not PBI',
'1' = 'PBI',
'NULL'= 'Not PBI' )) %>%
mutate(ANNHI = dplyr::recode(ANNHI,
'0' = 'Not ANNHI',
'1' = 'ANNHI',
'NULL'= 'Not ANNHI' )) %>%
mutate(TRIBAL = dplyr::recode(TRIBAL,
'0' = 'Not TRIBAL',
'1' = 'TRIBAL',
'NULL'= 'Not TRIBAL' )) %>%
mutate(AANAPII = dplyr::recode(AANAPII,
'0' = 'Not AANAPII',
'1' = 'AANAPII',
'NULL'= 'Not AANAPII' )) %>%
mutate(HSI = dplyr::recode(HSI,
'0' = 'Not HSI',
'1' = 'HSI',
'NULL'= 'Not HSI' )) %>%
mutate(NANTI = dplyr::recode(NANTI,
'0' = 'Not NANTI',
'1' = 'NANTI',
'NULL'= 'Not NANTI' )) %>%
# Collapse categories so that we have No Affiliation and Religious Affiliation
mutate(RELAFFIL = fct_other(RELAFFIL,
keep = "NULL"))%>%
mutate(RELAFFIL = dplyr::recode(RELAFFIL,
"NULL" = "No Affiliation",
"Other" = "Religious Affiliation"))
# Fuse categories into one variable
<- unite(college_scorecard_cat_demographics_rw, HBCU, PBI, ANNHI, TRIBAL, AANAPII,HSI, NANTI, col = "DEMOGRAPHICS", sep = "/")
college_scorecard_cat_demographics_fused
table(college_scorecard_cat_demographics_fused$DEMOGRAPHICS)
##
## HBCU/Not PBI/Not ANNHI/Not TRIBAL/Not AANAPII/Not HSI/Not NANTI
## 99
## Not HBCU/Not PBI/ANNHI/Not TRIBAL/AANAPII/Not HSI/Not NANTI
## 17
## Not HBCU/Not PBI/ANNHI/Not TRIBAL/Not AANAPII/HSI/NANTI
## 1
## Not HBCU/Not PBI/ANNHI/Not TRIBAL/Not AANAPII/Not HSI/NANTI
## 9
## Not HBCU/Not PBI/ANNHI/Not TRIBAL/Not AANAPII/Not HSI/Not NANTI
## 2
## Not HBCU/Not PBI/ANNHI/TRIBAL/AANAPII/Not HSI/Not NANTI
## 1
## Not HBCU/Not PBI/ANNHI/TRIBAL/Not AANAPII/Not HSI/Not NANTI
## 5
## Not HBCU/Not PBI/Not ANNHI/Not TRIBAL/AANAPII/HSI/Not NANTI
## 80
## Not HBCU/Not PBI/Not ANNHI/Not TRIBAL/AANAPII/Not HSI/Not NANTI
## 60
## Not HBCU/Not PBI/Not ANNHI/Not TRIBAL/Not AANAPII/HSI/NANTI
## 1
## Not HBCU/Not PBI/Not ANNHI/Not TRIBAL/Not AANAPII/HSI/Not NANTI
## 367
## Not HBCU/Not PBI/Not ANNHI/Not TRIBAL/Not AANAPII/Not HSI/NANTI
## 17
## Not HBCU/Not PBI/Not ANNHI/Not TRIBAL/Not AANAPII/Not HSI/Not NANTI
## 5264
## Not HBCU/Not PBI/Not ANNHI/TRIBAL/Not AANAPII/Not HSI/Not NANTI
## 29
## Not HBCU/PBI/Not ANNHI/Not TRIBAL/AANAPII/HSI/Not NANTI
## 1
## Not HBCU/PBI/Not ANNHI/Not TRIBAL/AANAPII/Not HSI/Not NANTI
## 1
## Not HBCU/PBI/Not ANNHI/Not TRIBAL/Not AANAPII/HSI/Not NANTI
## 8
## Not HBCU/PBI/Not ANNHI/Not TRIBAL/Not AANAPII/Not HSI/Not NANTI
## 84
As we can see from the various categories that were created, there are many schools that fit into multiple categories. Again, having many categories does not give us a clear picture. The solution to this problem is to collapse the categories that have multiple flags into a single one called “Multiple.”
#Clean up categories
<- college_scorecard_cat_demographics_fused %>%
college_scorecard_cat_demographics_complete mutate(DEMOGRAPHICS = dplyr::recode(DEMOGRAPHICS,
'HBCU/Not PBI/Not ANNHI/Not TRIBAL/Not AANAPII/Not HSI/Not NANTI' = 'HBCU',
'Not HBCU/Not PBI/ANNHI/Not TRIBAL/Not AANAPII/HSI/NANTI' = 'ANNHI/HSI/NANTI',
'Not HBCU/Not PBI/ANNHI/Not TRIBAL/Not AANAPII/Not HSI/Not NANTI' = 'ANNHI',
'Not HBCU/Not PBI/ANNHI/TRIBAL/Not AANAPII/Not HSI/Not NANTI' = 'ANNHI/TRIBAL',
'Not HBCU/Not PBI/Not ANNHI/Not TRIBAL/AANAPII/Not HSI/Not NANTI' = 'AANAPII',
'Not HBCU/Not PBI/Not ANNHI/Not TRIBAL/Not AANAPII/HSI/Not NANTI' = 'HSI',
'Not HBCU/Not PBI/Not ANNHI/Not TRIBAL/Not AANAPII/Not HSI/Not NANTI' = 'None',
'Not HBCU/PBI/Not ANNHI/Not TRIBAL/AANAPII/HSI/Not NANTI'= 'PBI/AANAPII/HSI',
'Not HBCU/PBI/Not ANNHI/Not TRIBAL/Not AANAPII/HSI/Not NANTI' = 'PBI/HSI',
'Not HBCU/Not PBI/ANNHI/Not TRIBAL/AANAPII/Not HSI/Not NANTI' = 'ANNHI/AANAPII',
'Not HBCU/Not PBI/ANNHI/Not TRIBAL/Not AANAPII/Not HSI/NANTI' = 'ANNHI/NANTI',
'Not HBCU/Not PBI/ANNHI/TRIBAL/AANAPII/Not HSI/Not NANTI' = 'ANNHI/TRIBAL/AANAPII',
'Not HBCU/Not PBI/Not ANNHI/Not TRIBAL/AANAPII/HSI/Not NANTI' = 'AANAPII/HSI',
'Not HBCU/Not PBI/Not ANNHI/Not TRIBAL/Not AANAPII/HSI/NANTI'= 'HSI/NANTI',
'Not HBCU/Not PBI/Not ANNHI/Not TRIBAL/Not AANAPII/Not HSI/NANTI' = 'NANTI',
'Not HBCU/Not PBI/Not ANNHI/TRIBAL/Not AANAPII/Not HSI/Not NANTI' = 'TRIBAL',
'Not HBCU/PBI/Not ANNHI/Not TRIBAL/AANAPII/Not HSI/Not NANTI' = 'PBI/AANAPII',
'Not HBCU/PBI/Not ANNHI/Not TRIBAL/Not AANAPII/Not HSI/Not NANTI' = 'PBI')) %>%
# Collapse all multi-categories into a single one
mutate(DEMOGRAPHICS = fct_collapse(DEMOGRAPHICS,
"Multiple" = c( 'ANNHI/HSI/NANTI', 'ANNHI/TRIBAL', 'PBI/AANAPII/HSI',
'PBI/HSI', 'ANNHI/AANAPII', 'ANNHI/NANTI',
'ANNHI/TRIBAL/AANAPII', 'AANAPII/HSI',
'HSI/NANTI', 'PBI/AANAPII'
)))
table(college_scorecard_cat_demographics_complete$DEMOGRAPHICS)
##
## AANAPII Multiple ANNHI HBCU HSI NANTI None PBI
## 60 124 2 99 367 17 5264 84
## TRIBAL
## 29
The 18 categories have been reduced to 10. We can now proceed to the next phase, which is to produce a plot that will test if there is a relationship between sex-specificity and demographics. One of the options we can use to plot a categorical variable against another categorical variable is a mosaic plot.
library(vcd)
<- xtabs(~ DEMOGRAPHICS + SEX_SPECIFIC, college_scorecard_cat_demographics_complete)
tbl_statusftable(tbl_status)
## SEX_SPECIFIC CO-ED MEN ONLY WOMEN ONLY
## DEMOGRAPHICS
## AANAPII 59 0 1
## Multiple 122 0 2
## ANNHI 2 0 0
## HBCU 96 1 2
## HSI 365 0 2
## NANTI 17 0 0
## None 5176 60 28
## PBI 84 0 0
## TRIBAL 29 0 0
mosaic(tbl_status,
labeling_args = list(rot_labels = c(top = 90, left = 0),
gp_labels=gpar(fontsize= 9),
offset_varnames = c(top = 2, left = 2), offset_labels = c(left = 0.5, top =2)),
spacing = spacing_increase(start = unit(0.45, "lines"), rate = 1),
margins = c(top = 0.25, bottom = 0.5),
gp = shading_hcl,
legend = T)
In this plot, each rectangle represents the intersection of 1 category
from one variable and another category from another variable. For
example, the largest rectangle represents the number of colleges that
are both co-ed and have no ethnic flag. The larger the area of the
rectangle, the more observations are in it. If there is no intersection
between the variable categories, a circle with a line is used.
The Pearson residual colors, essentially, tell us whether we have more or fewer colleges than expected if we were to assume that demographics and sex-specificity are independent (as in there is no relationship). The presence of blue and red tells us that the variables are not independent meaning they have some kind of relationship. As we can clearly see in the plot, all of the rectangles are grey, which indicates that the two variables are independent.
Now, if we take a look at religious affiliation and plot it against sex-specificity, we might get something different.
<- xtabs(~ RELAFFIL +SEX_SPECIFIC, college_scorecard_cat_demographics_complete)
tbl_status2ftable(tbl_status2)
## SEX_SPECIFIC CO-ED MEN ONLY WOMEN ONLY
## RELAFFIL
## No Affiliation 5107 28 19
## Religious Affiliation 843 33 16
mosaic(tbl_status2,
labeling_args = list(rot_labels = c(top = 90, left = 0),
gp_labels=gpar(fontsize= 8),
offset_varnames = c(top = 2, left = 4), offset_labels = c(left = 3, top =2)),
spacing = spacing_increase(start = unit(0.45, "lines"), rate = 1),
margins = c(top = 0.25, bottom = 0.5),
gp = shading_hcl,
legend = T)
Now, we have some color! It looks like for men-only and women-only
colleges, there are more values than expected when they are religiously
affiliated. This matches some of the observations we made when we first
looked at the map.
Let’s take a less confusing look by plotting the relationship between these two variables using an interactive stacked bar chart.
library(extrafont)
library(plotly)
<- college_scorecard_cat_demographics_complete %>%
plot_data count(RELAFFIL, SEX_SPECIFIC) %>%
add_count(SEX_SPECIFIC, wt= n) %>%
mutate(percentage = (n/nn))
plot_ly(x = plot_data[['SEX_SPECIFIC']], y = plot_data[['percentage']], color = plot_data[['RELAFFIL']], colors = c("#f1cb35",
"#cc8834")) %>%
add_bars() %>%
layout(barmode = "stack",
font = list(family = "Georgia"),
yaxis = list(title = "Percentage of Colleges", tickformat = "%"),
xaxis = list(title= "College Type" ))
From this chart, we can clearly see that around half of both men-only and women-only colleges are religious wheres 14% percent of coed colleges are religious.
To continue our examination of the relationship between sex-specificity and demographics, we will be using a logistic regression framework. Specifically, we want to answer the question “does the proportion of ethnic groups help predict sex-specificity?”
From our data set there are a number of numeric demographic variables to examine:
- UGDS_WHITE (proportion of students that are white)
- UGDS_BLACK (proportion of students that are black)
- UGDS_HISP (proportion of students that are Hispanic )
- UGDS_ASIAN (proportion of students that are Asian )
- UGDS_AIAN (proportion of students that are American Indian/ Alaskan natives)
- UGDS_NHPI (proportion of students that Native Hawaiian/Pacific Islander)
- UGDS_2MOR (proportion of students that are 2 or more ethnicities)
- UGDS_NRA (proportion of students that are non-resident aliens [ I may refer to them as international])
- UGDS_UNKN (proportion of students with unknown race)
With so many variables to consider, the easiest way to investigate their relationship with sex-specificity is to create a logistic regression summary table and see which variables have the most significant relationships. Because we need the variables to be binary, we will be using one of the data sets from earlier on in our project. There are a few clean-up steps like making sure the MENONLY and WOMENONLY columns are numeric and have no null values.
#Make character variables to be numeric
<- college_scorecard_clean[, c(2:3) :=lapply(.SD, as.numeric), .SDcols= c(2:3)]
college_scorecard_full_num_demographics
<- college_scorecard_full_num_demographics %>%
college_scorecard_dem_glanceselect(UGDS, UGDS_WHITE, UGDS_BLACK, UGDS_HISP, UGDS_ASIAN,
UGDS_AIAN, UGDS_NHPI, UGDS_2MOR, UGDS_NRA, UGDS_UNKN)
# Visualize Missing Values
library(visdat)
vis_miss(college_scorecard_dem_glance)+
theme(axis.text.x = element_text(angle = 90, hjust = 0.65),
text = element_text(family = "Georgia"))
Now our goal is to get rid of the missing rows so that we can run our logistic regression models.
<- college_scorecard_full_num_demographics %>%
college_scorecard_full_num_demographics2 filter(UGDS != is.na(UGDS))
<- college_scorecard_full_num_demographics2 %>%
college_scorecard_dem_glance2select(UGDS, UGDS_WHITE, UGDS_BLACK, UGDS_HISP, UGDS_ASIAN,
UGDS_AIAN, UGDS_NHPI, UGDS_2MOR, UGDS_NRA, UGDS_UNKN)
vis_miss(college_scorecard_dem_glance2)+
theme(axis.text.x = element_text(angle = 90, hjust = 0.65),
text = element_text(family = "Georgia"))
#Exclude non-operational colleges
<- college_scorecard_full_num_demographics2[CURROPER == "1"] college_scorecard_full_num_demographics_final
One important thing to address is the amount of colleges that were removed because of these clean up steps. We went from 6806 colleges to 5751. Because the missing values were consistent throughout these numeric demographics, the most likely reason they were left blank was because the researchers did not have access to the information.
First, we will create a table that shows the relationship between each explanatory numeric demographic variable and the MENONLY response variable.
library(broom)
#Create logistic regression model for predicting male-only colleges
<- tidy(glm(MENONLY ~ UGDS_WHITE, family= binomial, data = college_scorecard_full_num_demographics_final))
demographics_mod_male1
<- tidy(glm(MENONLY ~ UGDS_BLACK, family= binomial, data = college_scorecard_full_num_demographics_final))
demographics_mod_male2
<- tidy(glm(MENONLY ~ UGDS_HISP, family= binomial, data = college_scorecard_full_num_demographics_final))
demographics_mod_male3 <- tidy(glm(MENONLY ~ UGDS_ASIAN, family= binomial, data = college_scorecard_full_num_demographics_final))
demographics_mod_male4
<- tidy(glm(MENONLY ~ UGDS_AIAN, family= binomial, data = college_scorecard_full_num_demographics_final))
demographics_mod_male5
<- tidy(glm(MENONLY ~ UGDS_NHPI, family= binomial, data = college_scorecard_full_num_demographics_final))
demographics_mod_male6
<- tidy(glm(MENONLY ~ UGDS_2MOR, family= binomial, data = college_scorecard_full_num_demographics_final))
demographics_mod_male7
<- tidy(glm(MENONLY ~ UGDS_NRA, family= binomial, data = college_scorecard_full_num_demographics_final))
demographics_mod_male8
<- tidy(glm(MENONLY ~ UGDS_UNKN, family= binomial, data = college_scorecard_full_num_demographics_final))
demographics_mod_male9
<-bind_rows(demographics_mod_male1, demographics_mod_male2, demographics_mod_male3,
demographics_mod_male_main
demographics_mod_male4, demographics_mod_male5, demographics_mod_male6,
demographics_mod_male7, demographics_mod_male8, demographics_mod_male9)
demographics_mod_male_main
Because we cannot visualize a logistic regression that is more than 4 dimensions (3 explanatory variables and 1 response variable), we need to select the explanatory variables that are the most significant for our final model. There is also an extra incentive to get rid of some of these demographic variables because they are closely related and could create numerical instability. From this table, we can see that most of the explanatory variables independently have significant relationships with the MENONLY variable. The NRA variable and NHPI variables seem to have the least significant relationships with the MENONLY variable.
Now we will create a multivariate logistic model that takes into account all of the remaining variables to see if we can eliminate more explanatory variables.
<- glm(MENONLY ~ UGDS_WHITE + UGDS_BLACK + UGDS_HISP + UGDS_ASIAN +
demographics_mod_male_main2 + UGDS_2MOR + UGDS_UNKN, family= binomial, data = college_scorecard_full_num_demographics_final)
UGDS_AIAN
summary(demographics_mod_male_main2)
##
## Call:
## glm(formula = MENONLY ~ UGDS_WHITE + UGDS_BLACK + UGDS_HISP +
## UGDS_ASIAN + UGDS_AIAN + UGDS_2MOR + UGDS_UNKN, family = binomial,
## data = college_scorecard_full_num_demographics_final)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.6941 -0.0479 -0.0128 -0.0023 5.4707
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.4389 0.8093 -1.778 0.075415 .
## UGDS_WHITE 0.1384 0.8362 0.166 0.868534
## UGDS_BLACK -10.8152 3.0340 -3.565 0.000364 ***
## UGDS_HISP -4.9057 1.5689 -3.127 0.001767 **
## UGDS_ASIAN -7.5415 5.6959 -1.324 0.185497
## UGDS_AIAN -201.6041 114.6773 -1.758 0.078746 .
## UGDS_2MOR -73.5400 20.1115 -3.657 0.000256 ***
## UGDS_UNKN -65.3860 28.0082 -2.335 0.019568 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 648.63 on 5750 degrees of freedom
## Residual deviance: 364.70 on 5743 degrees of freedom
## AIC: 380.7
##
## Number of Fisher Scoring iterations: 13
Here, some of these variables seem to have very large standard errors. The larger the standard error, the less precise the statistic, meaning the samples may not closely represent the population. A large standard error could be caused by having a small sample size, a large standard deviation, or a combination of both. To keep things simple, we will be removing the AIAN, 2MOR, and UNKN variables too because of their very large standard errors.
<- glm(MENONLY ~ UGDS_WHITE + UGDS_BLACK + UGDS_HISP +UGDS_ASIAN, family= binomial, data = college_scorecard_full_num_demographics_final)
demographics_mod_male3
summary(demographics_mod_male3)
##
## Call:
## glm(formula = MENONLY ~ UGDS_WHITE + UGDS_BLACK + UGDS_HISP +
## UGDS_ASIAN, family = binomial, data = college_scorecard_full_num_demographics_final)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.5521 -0.1057 -0.0300 -0.0063 6.6810
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.932 2.089 -3.318 0.000907 ***
## UGDS_WHITE 5.128 2.153 2.382 0.017219 *
## UGDS_BLACK -16.316 5.175 -3.153 0.001618 **
## UGDS_HISP -2.536 3.505 -0.724 0.469364
## UGDS_ASIAN -21.548 12.738 -1.692 0.090718 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 648.63 on 5750 degrees of freedom
## Residual deviance: 464.62 on 5746 degrees of freedom
## AIC: 474.62
##
## Number of Fisher Scoring iterations: 10
Now, our table shows UGDS_ASIAN is not only statistically insignificant but also has a large standard error compared to the other three variables. Additionally, UGDS_HISP also appears to be statistically insignificant. With this in mind, our final model will only incorporate two variables (UGDS_WHITE and UGDS_BLACK).
But before we do the finalizing, we should perform a couple of steps. First, we should check if there is collinearity between our two variables so that it does not create instability in our model.
#Collinearity Check
%>%
college_scorecard_full_num_demographics_final plot_ly(y = ~UGDS_BLACK, x= ~UGDS_WHITE, colors = "#44668b", text = ~INSTNM) %>%
add_markers()
#Correlation Check
cor(college_scorecard_full_num_demographics_final$UGDS_WHITE, college_scorecard_full_num_demographics_final$UGDS_BLACK)
## [1] -0.4719334
So it looks like the variables are very correlated but not so much that they create instability.
Second, we should compare our model (with proportion of black and white students) to a model without black students so that we can see if the standard errors drastically change.
We are just going to rename the UGDS_BLACK variables so that our final plot is easier to to read.
# Rename variables
<- college_scorecard_full_num_demographics_final %>%
college_scorecard_full_num_demographics_final2 rename( Black_Students = UGDS_BLACK)
# Final Male Models
<- glm(MENONLY ~ UGDS_WHITE + Black_Students, family= binomial,
demographics_mod_male_final data = college_scorecard_full_num_demographics_final2) # White and Black Students
<- glm(MENONLY ~ UGDS_WHITE, family= binomial,
demographics_mod_male_final2 data = college_scorecard_full_num_demographics_final2) # White Students Only
summary(demographics_mod_male_final)
##
## Call:
## glm(formula = MENONLY ~ UGDS_WHITE + Black_Students, family = binomial,
## data = college_scorecard_full_num_demographics_final2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.5547 -0.1080 -0.0323 -0.0089 6.8577
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -9.667 1.343 -7.199 6.05e-13 ***
## UGDS_WHITE 7.873 1.434 5.492 3.98e-08 ***
## Black_Students -14.706 5.223 -2.816 0.00487 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 648.63 on 5750 degrees of freedom
## Residual deviance: 468.47 on 5748 degrees of freedom
## AIC: 474.47
##
## Number of Fisher Scoring iterations: 10
summary(demographics_mod_male_final2)
##
## Call:
## glm(formula = MENONLY ~ UGDS_WHITE, family = binomial, data = college_scorecard_full_num_demographics_final2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.5465 -0.1174 -0.0404 -0.0084 5.1060
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -13.072 1.108 -11.796 <2e-16 ***
## UGDS_WHITE 11.246 1.246 9.026 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 648.63 on 5750 degrees of freedom
## Residual deviance: 478.54 on 5749 degrees of freedom
## AIC: 482.54
##
## Number of Fisher Scoring iterations: 10
The standard errors between models don’t differ much, so we are free to use our first one (with both black and white students).
Our final model shows that there is a significant relationship between being a men-only institution and particular proportions of certain ethnicities. Specifically, we can see that there is a positive relationship between men-only institutions and white students as well as a negative relationship between men-only institutions and black students. This finding does not seem too surprising because our map above showed that many of these institutions were seminary schools, which one can safely assume are mostly white men. But for the sake of transparency let’s take a quick look at a table of these institutions:
library(knitr)
<- Men_only %>%
Men_only_glance select(INSTNM, UGDS_WHITE, UGDS_BLACK)
kable(Men_only_glance, col.names = c('Name', 'Prop. White', 'Prop. Black'), align = 'cc', caption = "Table 1.1 The Proportion of White and Black Students Per Institution")
Name | Prop. White | Prop. Black |
---|---|---|
Yeshiva Ohr Elchonon Chabad West Coast Talmudical Seminary | 0.9935 | 0.0000 |
St. John Vianney College Seminary | 0.3333 | 0.0476 |
Talmudic College of Florida | 0.8095 | 0.0000 |
Morehouse College | 0.0032 | 0.9433 |
Telshe Yeshiva-Chicago | 0.9870 | 0.0000 |
Saint Meinrad School of Theology | NA | NA |
Wabash College | 0.7446 | 0.0568 |
Saint Joseph Seminary College | 0.6170 | 0.0071 |
Ner Israel Rabbinical College | 0.9036 | 0.0000 |
Pope St John XXIII National Seminary | NA | NA |
Saint John’s Seminary | 0.5417 | 0.0000 |
Saint Johns University | 0.7804 | 0.0456 |
Conception Seminary College | 0.7000 | 0.0000 |
Beth Medrash Govoha | 0.9683 | 0.0000 |
Rabbinical College of America | 0.8438 | 0.0000 |
Talmudical Academy-New Jersey | 0.9655 | 0.0000 |
Beth Hamedrash Shaarei Yosher Institute | 1.0000 | 0.0000 |
Central Yeshiva Tomchei Tmimim Lubavitz | 0.8374 | 0.0000 |
Kehilath Yakov Rabbinical Seminary | 1.0000 | 0.0000 |
Machzikei Hadath Rabbinical College | 0.9933 | 0.0000 |
Mesivta Torah Vodaath Rabbinical Seminary | 1.0000 | 0.0000 |
Mesivta of Eastern Parkway-Yeshiva Zichron Meilech | 1.0000 | 0.0000 |
Mesivtha Tifereth Jerusalem of America | 0.9767 | 0.0000 |
Mirrer Yeshiva Cent Institute | 0.9323 | 0.0000 |
Ohr Hameir Theological Seminary | 0.9255 | 0.0000 |
Rabbinical Academy Mesivta Rabbi Chaim Berlin | 0.9655 | 0.0000 |
Rabbinical College Bobover Yeshiva Bnei Zion | 1.0000 | 0.0000 |
Rabbinical College Beth Shraga | 1.0000 | 0.0000 |
Rabbinical College of Long Island | 0.9872 | 0.0000 |
Rabbinical Seminary of America | 1.0000 | 0.0000 |
Sh’or Yoshuv Rabbinical College | 0.9493 | 0.0000 |
Talmudical Seminary Oholei Torah | 0.9032 | 0.0000 |
Talmudical Institute of Upstate New York | 1.0000 | 0.0000 |
Torah Temimah Talmudical Seminary | 0.9524 | 0.0000 |
United Talmudical Seminary | 0.9527 | 0.0000 |
Yeshiva Karlin Stolin | 1.0000 | 0.0000 |
Yeshiva Derech Chaim | 0.8182 | 0.0000 |
Yeshiva of Nitra Rabbinical College | 0.9433 | 0.0000 |
Yeshiva Shaar Hatorah | 1.0000 | 0.0000 |
Yeshivath Viznitz | 1.0000 | 0.0000 |
Yeshivath Zichron Moshe | 0.8933 | 0.0000 |
Pontifical College Josephinum | 0.7444 | 0.0222 |
Rabbinical College Telshe | 1.0000 | 0.0000 |
Mount Angel Seminary | 0.2500 | 0.0000 |
Saint Charles Borromeo Seminary-Overbrook | 0.6290 | 0.0968 |
Talmudical Yeshiva of Philadelphia | 0.9593 | 0.0000 |
Yeshivath Beth Moshe | 1.0000 | 0.0000 |
Hampden-Sydney College | 0.8563 | 0.0485 |
Sacred Heart Seminary and School of Theology | NA | NA |
Bais Medrash Elyon | 1.0000 | 0.0000 |
Yeshiva Gedolah of Greater Detroit | 1.0000 | 0.0000 |
Yeshivah Gedolah Rabbinical College | 0.9118 | 0.0000 |
Yeshiva Gedolah Imrei Yosef D’spinka | 1.0000 | 0.0000 |
Yeshivas Novominsk | 0.9811 | 0.0000 |
Rabbinical College of Ohr Shimon Yisroel | 1.0000 | 0.0000 |
Yeshiva D’monsey Rabbinical College | 1.0000 | 0.0000 |
Yeshiva of the Telshe Alumni | 1.0000 | 0.0000 |
Yeshiva College of the Nations Capital | 1.0000 | 0.0000 |
Yeshiva Shaarei Torah of Rockland | 0.9359 | 0.0000 |
Beis Medrash Heichal Dovid | 1.0000 | 0.0000 |
Uta Mesivta of Kiryas Joel | 0.9928 | 0.0000 |
For women-only colleges, we will follow a similar procedure.
#Create logistic regression model for predicting female-only colleges
<- tidy(glm(WOMENONLY ~ UGDS_WHITE, family= binomial, data = college_scorecard_full_num_demographics_final))
demographics_mod_female1
<- tidy(glm(WOMENONLY ~ UGDS_BLACK, family= binomial, data = college_scorecard_full_num_demographics_final))
demographics_mod_female2
<- tidy(glm(WOMENONLY ~ UGDS_HISP, family= binomial, data = college_scorecard_full_num_demographics_final))
demographics_mod_female3 <- tidy(glm(WOMENONLY ~ UGDS_ASIAN, family= binomial, data = college_scorecard_full_num_demographics_final))
demographics_mod_female4
<- tidy(glm(WOMENONLY ~ UGDS_AIAN, family= binomial, data = college_scorecard_full_num_demographics_final))
demographics_mod_female5
<- tidy(glm(WOMENONLY ~ UGDS_NHPI, family= binomial, data = college_scorecard_full_num_demographics_final))
demographics_mod_female6
<- tidy(glm(WOMENONLY ~ UGDS_2MOR, family= binomial, data = college_scorecard_full_num_demographics_final))
demographics_mod_female7
<- tidy(glm(WOMENONLY ~ UGDS_NRA, family= binomial, data = college_scorecard_full_num_demographics_final))
demographics_mod_female8
<- tidy(glm(WOMENONLY ~ UGDS_UNKN, family= binomial, data = college_scorecard_full_num_demographics_final))
demographics_mod_female9
<-bind_rows(demographics_mod_female1, demographics_mod_female2, demographics_mod_female3,
demographics_mod_female_main
demographics_mod_female4, demographics_mod_female5, demographics_mod_female6,
demographics_mod_female7, demographics_mod_female8, demographics_mod_female9)
demographics_mod_female_main
Based on some of the standard errors in this table, we will be getting rid of UGDS_AIAN and UGDS_NHPI. Additionally, there appears to be only two variables that are significant (UGDS_NRA, UGDS_2MOR).
<- glm(WOMENONLY ~ UGDS_2MOR + UGDS_NRA, family= binomial, data = college_scorecard_full_num_demographics_final)
demographics_mod_female_main
summary(demographics_mod_female_main)
##
## Call:
## glm(formula = WOMENONLY ~ UGDS_2MOR + UGDS_NRA, family = binomial,
## data = college_scorecard_full_num_demographics_final)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.5849 -0.1113 -0.1052 -0.0994 3.2663
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.3470 0.2078 -25.737 <2e-16 ***
## UGDS_2MOR 4.5186 2.4720 1.828 0.0676 .
## UGDS_NRA 2.8175 1.2308 2.289 0.0221 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 426.91 on 5750 degrees of freedom
## Residual deviance: 421.33 on 5748 degrees of freedom
## AIC: 427.33
##
## Number of Fisher Scoring iterations: 8
Because UGDS_2MOR did not reach our p-value threshold, we will exclude it from the final model.
<- glm(WOMENONLY ~ UGDS_NRA, family= binomial, data = college_scorecard_full_num_demographics_final)
demographics_mod_female_final
summary(demographics_mod_female_final)
##
## Call:
## glm(formula = WOMENONLY ~ UGDS_NRA, family = binomial, data = college_scorecard_full_num_demographics_final)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.4166 -0.1084 -0.1057 -0.1057 3.2217
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.1841 0.1791 -28.94 <2e-16 ***
## UGDS_NRA 2.7833 1.2046 2.31 0.0209 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 426.91 on 5750 degrees of freedom
## Residual deviance: 423.62 on 5749 degrees of freedom
## AIC: 427.62
##
## Number of Fisher Scoring iterations: 8
Based on this table, we can conclude that there is a significant relationship between proportion of nonresident alien students and women-only colleges. This finding was unexpected. I think a look at what is going on would be helpful:
library(knitr)
<- Women_only %>%
Women_only_glance select(INSTNM, UGDS_NRA)
kable(Women_only_glance, col.names = c('Name', 'Prop. Nonresident Aliens'), align = 'cc', caption = "Table 1.2 The Proportion of Nonresident Students Per Institution")
Name | Prop. Nonresident Aliens |
---|---|
Judson College | 0.0000 |
Mills College | 0.0119 |
Mount Saint Mary’s University | 0.0009 |
Scripps College | 0.0487 |
Trinity Washington University | 0.0051 |
Agnes Scott College | 0.0652 |
Brenau University | 0.0498 |
Spelman College | 0.0074 |
Wesleyan College | 0.0916 |
Saint Mary’s College | 0.0097 |
Notre Dame of Maryland University | 0.0132 |
Bay Path University | 0.0047 |
Mount Holyoke College | 0.2717 |
Simmons University | 0.0497 |
Smith College | 0.1378 |
Wellesley College | 0.1355 |
College of Saint Benedict | 0.0438 |
St Catherine University | 0.0065 |
Cottey College | 0.1218 |
Stephens College | 0.0018 |
College of Saint Mary | 0.0111 |
Barnard College | 0.0974 |
Bennett College | 0.0000 |
Meredith College | 0.0141 |
Salem College | 0.0000 |
Ursuline College | 0.0209 |
Bryn Mawr College | 0.2151 |
Cedar Crest College | 0.0890 |
Moore College of Art and Design | 0.0161 |
Converse College | 0.0571 |
Hollins University | 0.0661 |
Mary Baldwin University | 0.0072 |
Sweet Briar College | 0.0256 |
Alverno College | 0.0017 |
Mount Mary University | 0.0132 |
<- college_scorecard_completely_prepped %>%
nra_avges select(SEX_SPECIFIC, UGDS_NRA) %>%
group_by(SEX_SPECIFIC) %>%
summarize( UGDS_NRA_AVG= mean(UGDS_NRA, na.rm = TRUE))
kable(nra_avges, col.names = c('Institution Type', 'Mean Proportion Nonresident Aliens'), align = 'cc', caption = "Table 1.3 The Average Proportion of Nonresident Students Per Type of Institution")
Institution Type | Mean Proportion Nonresident Aliens |
---|---|
CO-ED | 0.0216359 |
MEN ONLY | 0.0446603 |
WOMEN ONLY | 0.0488971 |
Although an average of 4.9 % of the student body being international does not seem like much, when compared to the other types of schools, it looks like women colleges have more international students.
library(visreg)
visreg(demographics_mod_male_final, "UGDS_WHITE", by = "Black_Students",
rug = FALSE ,
band= FALSE,
xlab = "Proportion of White Students",
ylab = "Probabiltiy of Men-Only College",
line= list(col ="#1298d0"),
scale="response",
gg = TRUE)+
theme_bw()+
labs(title = "Logistic Regression Prediction of Being Men-Only College")+
theme(legend.position = "none",
text = element_text(family = "Georgia"),
plot.title = element_text(size = 15, margin(b = 10), hjust = 0.5, family = "Georgia"),
strip.background = element_rect(
color="Black", fill="#1298d0", linetype="solid"))
visreg(demographics_mod_female_final, "UGDS_NRA",
rug = FALSE ,
band= FALSE,
xlab = "Proportion of Nonresident Aliens",
ylab = "Probabiltiy of Being Women-Only College",
line= list(col ="#cc8834"),
scale="response",
gg = TRUE)+
theme_bw()+
labs(title = "Logistic Regression Prediction of Women-Only College")+
theme(legend.position = "none",
text = element_text(family = "Georgia"),
plot.title = element_text(size = 15, margin(b = 10), hjust = 0.5, family = "Georgia") )
As we can see from the graphs, the relationships between these
demographic variables and sex-specificity are still fairly small despite
the statistical significance of the models. Testing the accuracy or
recall of the models does not seem appropriate given how weak the
relationships are.
Admission Statistics
During the previous section, we had the numeric variable be the explanatory variable and the sex-specificity (a categorical variable) be the response variable. In this section and for the rest of the sections, sex-specificity is going to become the response variable. This transition was made, firstly, because creating logistic regression models for each one of these questions is going to be too much of a pain. Secondly, because these variables under investigation seem to have, for the most part, small relationships with gender, I feel the best thing would be to use a logistic regression framework using only the most significant variables. What would probably work best for an exploratory analysis like this one would be an anova framework that allows us to get a quick glance at the data using bar charts.
Our next step is to answer, “how do admission standards differ across the sex-specific colleges?”
Here are our explanatory variables:
- ADM_RATE (Admission Rate)
- SATVRMID (SAT Verbal Median)
- SATMTMID (SAT Math Median)
- SATWRMID (SAT Writing Median)
- SAT_AVG (SAT Overall Average)
- ACTCMMID (ACT Cumulative Median)
- ACTENMID (ACT English Median)
- ACTMTMID (ACT Math Median)
- ACTWRMID (ACT Writing Median)
Because there is a strong possibility that these variables will be collinear, it would be a good idea to take a look at what the overall averages are for each sex:
%>%
college_scorecard_completely_prepped select(SEX_SPECIFIC,
ADM_RATE,
SATVRMID, SATMTMID, SAT_AVG,
ACTCMMID, ACTENMID, %>%
ACTMTMID, ACTWRMID)group_by(SEX_SPECIFIC) %>%
summarize(ADM_RATE = mean(ADM_RATE, na.rm= TRUE),
SATVRMID= mean(SATVRMID, na.rm= TRUE), SATMTMID= mean(SATMTMID, na.rm= TRUE), SAT_AVG= mean(SAT_AVG, na.rm= TRUE),
ACTCMMID= mean(ACTCMMID, na.rm= TRUE), ACTENMID= mean(ACTENMID, na.rm= TRUE),
ACTMTMID= mean(ACTMTMID, na.rm= TRUE), ACTWRMID= mean(ACTWRMID, na.rm= TRUE)) %>%
kable(align = 'cc', caption = "Table 1.4 The Average Adission Statistics Per Type of Institution")
SEX_SPECIFIC | ADM_RATE | SATVRMID | SATMTMID | SAT_AVG | ACTCMMID | ACTENMID | ACTMTMID | ACTWRMID |
---|---|---|---|---|---|---|---|---|
CO-ED | 0.6768418 | 565.117 | 560.4017 | 1141.760 | 23.55411 | 23.28066 | 22.67185 | 7.688889 |
MEN ONLY | 0.8322167 | 575.000 | 580.0000 | 1174.200 | 24.40000 | 23.60000 | 23.80000 | 7.666667 |
WOMEN ONLY | 0.6096457 | 584.037 | 557.0370 | 1151.571 | 24.03571 | 24.48000 | 22.44000 | 8.600000 |
Women-only colleges being on average more selective is not surprising. With only 35 of them and there being very well known selective institutions like Scripps College and Barnard College, these findings were more or less expected. However, what came as a bit of a surprise was the selectivity of men-only colleges. Normally, as admission rates go down scores go up. Notice that the admission rates for men-only colleges is much higher than co-ed and women-only colleges. If men-only colleges are more selective, then we should see an admission rate under 60 percent. Additionally, since none of the men-only colleges are as selective as the most selective women-only colleges, what is driving these averages up so high?
Let’s take a look:
<- function(sex) {
admissions_table %>%
college_scorecard_completely_prepped filter(SEX_SPECIFIC == sex) %>%
select(INSTNM,
ADM_RATE,
SAT_AVG, SATVRMID, SATMTMID,
ACTCMMID, ACTENMID, %>%
ACTMTMID, ACTWRMID) kable(align = 'cc', caption = "Tables 1.4 Admission Statistics for Institution Type")
}
admissions_table("MEN ONLY")
INSTNM | ADM_RATE | SAT_AVG | SATVRMID | SATMTMID | ACTCMMID | ACTENMID | ACTMTMID | ACTWRMID |
---|---|---|---|---|---|---|---|---|
Yeshiva Ohr Elchonon Chabad West Coast Talmudical Seminary | 0.7113 | NA | NA | NA | NA | NA | NA | NA |
St. John Vianney College Seminary | 1.0000 | NA | NA | NA | NA | NA | NA | NA |
Talmudic College of Florida | NA | NA | NA | NA | NA | NA | NA | NA |
Morehouse College | 0.5788 | 1120 | 560 | 550 | 23 | 23 | 22 | NA |
Telshe Yeshiva-Chicago | 1.0000 | NA | NA | NA | NA | NA | NA | NA |
Saint Meinrad School of Theology | NA | NA | NA | NA | NA | NA | NA | NA |
Wabash College | 0.6497 | 1223 | 600 | 615 | 26 | 24 | 26 | 8 |
Saint Joseph Seminary College | NA | NA | NA | NA | NA | NA | NA | NA |
Ner Israel Rabbinical College | 0.8409 | NA | NA | NA | NA | NA | NA | NA |
Pope St John XXIII National Seminary | NA | NA | NA | NA | NA | NA | NA | NA |
Saint John’s Seminary | NA | NA | NA | NA | NA | NA | NA | NA |
Saint Johns University | 0.7970 | 1209 | 560 | 585 | 25 | 24 | 25 | 8 |
Conception Seminary College | 1.0000 | 1155 | NA | NA | 23 | 23 | 23 | 7 |
Beth Medrash Govoha | NA | NA | NA | NA | NA | NA | NA | NA |
Rabbinical College of America | 0.8732 | NA | NA | NA | NA | NA | NA | NA |
Talmudical Academy-New Jersey | 0.8333 | NA | NA | NA | NA | NA | NA | NA |
Beth Hamedrash Shaarei Yosher Institute | 0.7879 | NA | NA | NA | NA | NA | NA | NA |
Central Yeshiva Tomchei Tmimim Lubavitz | 1.0000 | NA | NA | NA | NA | NA | NA | NA |
Kehilath Yakov Rabbinical Seminary | NA | NA | NA | NA | NA | NA | NA | NA |
Machzikei Hadath Rabbinical College | 0.9000 | NA | NA | NA | NA | NA | NA | NA |
Mesivta Torah Vodaath Rabbinical Seminary | 0.6250 | NA | NA | NA | NA | NA | NA | NA |
Mesivta of Eastern Parkway-Yeshiva Zichron Meilech | 0.8750 | NA | NA | NA | NA | NA | NA | NA |
Mesivtha Tifereth Jerusalem of America | 0.9048 | NA | NA | NA | NA | NA | NA | NA |
Mirrer Yeshiva Cent Institute | 0.7500 | NA | NA | NA | NA | NA | NA | NA |
Ohr Hameir Theological Seminary | NA | NA | NA | NA | NA | NA | NA | NA |
Rabbinical Academy Mesivta Rabbi Chaim Berlin | 1.0000 | NA | NA | NA | NA | NA | NA | NA |
Rabbinical College Bobover Yeshiva Bnei Zion | NA | NA | NA | NA | NA | NA | NA | NA |
Rabbinical College Beth Shraga | 1.0000 | NA | NA | NA | NA | NA | NA | NA |
Rabbinical College of Long Island | 0.9778 | NA | NA | NA | NA | NA | NA | NA |
Rabbinical Seminary of America | 0.9873 | NA | NA | NA | NA | NA | NA | NA |
Sh’or Yoshuv Rabbinical College | 0.6667 | NA | NA | NA | NA | NA | NA | NA |
Talmudical Seminary Oholei Torah | 0.9429 | NA | NA | NA | NA | NA | NA | NA |
Talmudical Institute of Upstate New York | NA | NA | NA | NA | NA | NA | NA | NA |
Torah Temimah Talmudical Seminary | 0.8000 | NA | NA | NA | NA | NA | NA | NA |
United Talmudical Seminary | NA | NA | NA | NA | NA | NA | NA | NA |
Yeshiva Karlin Stolin | 0.9574 | NA | NA | NA | NA | NA | NA | NA |
Yeshiva Derech Chaim | 0.4861 | NA | NA | NA | NA | NA | NA | NA |
Yeshiva of Nitra Rabbinical College | NA | NA | NA | NA | NA | NA | NA | NA |
Yeshiva Shaar Hatorah | 0.8056 | NA | NA | NA | NA | NA | NA | NA |
Yeshivath Viznitz | 0.9909 | NA | NA | NA | NA | NA | NA | NA |
Yeshivath Zichron Moshe | 0.4800 | NA | NA | NA | NA | NA | NA | NA |
Pontifical College Josephinum | 0.8333 | NA | NA | NA | NA | NA | NA | NA |
Rabbinical College Telshe | 0.8400 | NA | NA | NA | NA | NA | NA | NA |
Mount Angel Seminary | 1.0000 | NA | NA | NA | NA | NA | NA | NA |
Saint Charles Borromeo Seminary-Overbrook | 0.9130 | NA | NA | NA | NA | NA | NA | NA |
Talmudical Yeshiva of Philadelphia | 0.8444 | NA | NA | NA | NA | NA | NA | NA |
Yeshivath Beth Moshe | 0.9091 | NA | NA | NA | NA | NA | NA | NA |
Hampden-Sydney College | 0.5903 | 1164 | 580 | 570 | 25 | 24 | 23 | NA |
Sacred Heart Seminary and School of Theology | NA | NA | NA | NA | NA | NA | NA | NA |
Bais Medrash Elyon | 0.8000 | NA | NA | NA | NA | NA | NA | NA |
Yeshiva Gedolah of Greater Detroit | 0.9375 | NA | NA | NA | NA | NA | NA | NA |
Yeshivah Gedolah Rabbinical College | NA | NA | NA | NA | NA | NA | NA | NA |
Yeshiva Gedolah Imrei Yosef D’spinka | NA | NA | NA | NA | NA | NA | NA | NA |
Yeshivas Novominsk | 0.9833 | NA | NA | NA | NA | NA | NA | NA |
Rabbinical College of Ohr Shimon Yisroel | NA | NA | NA | NA | NA | NA | NA | NA |
Yeshiva D’monsey Rabbinical College | 0.5313 | NA | NA | NA | NA | NA | NA | NA |
Yeshiva of the Telshe Alumni | 0.7857 | NA | NA | NA | NA | NA | NA | NA |
Yeshiva College of the Nations Capital | NA | NA | NA | NA | NA | NA | NA | NA |
Yeshiva Shaarei Torah of Rockland | NA | NA | NA | NA | NA | NA | NA | NA |
Beis Medrash Heichal Dovid | 0.7636 | NA | NA | NA | NA | NA | NA | NA |
Uta Mesivta of Kiryas Joel | NA | NA | NA | NA | NA | NA | NA | NA |
admissions_table("WOMEN ONLY")
INSTNM | ADM_RATE | SAT_AVG | SATVRMID | SATMTMID | ACTCMMID | ACTENMID | ACTMTMID | ACTWRMID |
---|---|---|---|---|---|---|---|---|
Judson College | 0.4820 | 1054 | 568 | 530 | 20 | 21 | 19 | NA |
Mills College | 0.8554 | 1158 | 577 | 548 | 25 | 26 | 23 | NA |
Mount Saint Mary’s University | 0.8382 | 1031 | 530 | 500 | 20 | NA | NA | NA |
Scripps College | 0.2424 | 1409 | 700 | 690 | 32 | 34 | 29 | 9 |
Trinity Washington University | 0.9512 | NA | NA | NA | NA | NA | NA | NA |
Agnes Scott College | 0.7040 | NA | NA | NA | NA | NA | NA | NA |
Brenau University | 0.5688 | 1031 | 525 | 495 | 20 | 20 | 19 | NA |
Spelman College | 0.3934 | 1165 | 595 | 555 | 24 | 24 | 23 | NA |
Wesleyan College | 0.4787 | 1030 | 535 | 485 | 20 | 20 | 18 | NA |
Saint Mary’s College | 0.8190 | 1186 | 595 | 560 | 25 | 26 | 24 | NA |
Notre Dame of Maryland University | 0.8763 | 1069 | 535 | 500 | 24 | NA | NA | NA |
Bay Path University | 0.6041 | NA | NA | NA | NA | NA | NA | NA |
Mount Holyoke College | 0.5091 | 1395 | 680 | 715 | 31 | 33 | 29 | NA |
Simmons University | 0.6972 | 1228 | 620 | 595 | 27 | 28 | 25 | NA |
Smith College | 0.3095 | NA | NA | NA | NA | NA | NA | NA |
Wellesley College | 0.1954 | 1435 | 705 | 720 | 32 | 34 | 30 | 9 |
College of Saint Benedict | 0.8338 | 1195 | 560 | 540 | 25 | 25 | 24 | 8 |
St Catherine University | 0.7325 | 1124 | 619 | 572 | 22 | 21 | 21 | NA |
Cottey College | 0.3804 | 1096 | 535 | 505 | 22 | 22 | 20 | NA |
Stephens College | 0.5886 | 1138 | 575 | 515 | 23 | 23 | 20 | NA |
College of Saint Mary | 0.5210 | NA | NA | NA | NA | NA | NA | NA |
Barnard College | 0.1392 | 1422 | 705 | 710 | 32 | 34 | 30 | NA |
Bennett College | 0.9634 | 911 | 465 | 435 | 17 | 16 | 17 | NA |
Meredith College | 0.6257 | 1117 | 560 | 540 | 23 | 21 | 22 | NA |
Salem College | 0.3947 | 1183 | 585 | 570 | 25 | 27 | 23 | NA |
Ursuline College | 0.9034 | 1102 | 550 | 540 | 22 | 21 | 21 | 8 |
Bryn Mawr College | 0.3408 | NA | NA | NA | NA | NA | NA | 9 |
Cedar Crest College | 0.5719 | 1065 | 540 | 515 | 22 | 22 | 21 | NA |
Moore College of Art and Design | 0.5195 | 1194 | 615 | 565 | 27 | NA | NA | NA |
Converse College | 0.5852 | 1129 | 570 | 550 | 23 | 23 | 21 | NA |
Hollins University | 0.6392 | 1204 | 610 | 565 | 27 | 28 | 22 | NA |
Mary Baldwin University | 0.9991 | 1053 | 540 | 500 | 21 | 21 | 20 | NA |
Sweet Briar College | 0.7569 | 1120 | 575 | 525 | 23 | 24 | 22 | NA |
Alverno College | 0.6978 | NA | NA | NA | NA | NA | NA | NA |
Mount Mary University | 0.6198 | 1000 | NA | NA | 19 | 18 | 18 | NA |
I knew it! Something was indeed going on here! Because most of the male colleges have NA’s, for their admission test statistics, only 5/61 contributed their numbers to those averages. This means that while most of the women-only college standards were averaged, only the more selective men-only college averages were used. This also makes sense because if most of your colleges are seminary schools, how could your average be so high?
This finding actually makes our job easier because the only valid variable to use is admission rate given the scarcity of data for the admission tests.
summary( lm( ADM_RATE ~ SEX_SPECIFIC, data= college_scorecard_completely_prepped ))
##
## Call:
## lm(formula = ADM_RATE ~ SEX_SPECIFIC, data = college_scorecard_completely_prepped)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.67684 -0.12459 0.02196 0.16496 0.38945
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.676842 0.004976 136.027 < 2e-16 ***
## SEX_SPECIFICMEN ONLY 0.155375 0.033817 4.595 4.61e-06 ***
## SEX_SPECIFICWOMEN ONLY -0.067196 0.036978 -1.817 0.0693 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2168 on 1972 degrees of freedom
## (4071 observations deleted due to missingness)
## Multiple R-squared: 0.0124, Adjusted R-squared: 0.01139
## F-statistic: 12.38 on 2 and 1972 DF, p-value: 4.556e-06
Here, we can clearly see that there is a significant positive relationship between admission rates and men-only institutions. Although women colleges did not cross the significance threshold, you can clearly see that there is a negative relationship between admission rate and women-only colleges.
Majors
For this section, I had two interesting questions:
Does attending a sex-specific school affect degree outcomes? Does going to a women-only college affect STEM degree outcomes?
Because the US Department of Education collects data on the percentage of students that graduate with particular majors, there are a lot of variables to work with!
However, since we are only interested in major academic divisions, we should be able to fuse most of these percentages and compare them that way. I decided to group them by three major divisions: Liberal Arts, STEM, and Professional/Vocational/Other.
Disclaimer: Many of these majors could have fit into a couple of groups, but in the interest of keeping things simple, I based my decisions mostly on this wikipedia page about academics fields.
Here is the breakdown of the variables:
STEM
- PCIP01 (agriculture)
- PCIP03 (Conservation)
- PCIP11 (computer science)
- PCIP14 (engineering)
- PCIP15 (engineering tech)
- PCIP26 (bio)
- PCIP27 (math)
- PCIP29 (applied science)
- PCIP40 (physics)
- PCIP41 (tech science)
- PCIP51 (health)
Professions, Vocations, and Other (everything I couldn t classify in the other 2)
- PCIP04 (architecture)
- PCIP09 (journalism/communication)
- PCIP10 (communications technology)
- PCIP12 (culinary)
- PCIP19 (consumer/human sciences)
- PCIP22 (legal)
- PCIP25 (library)
- PCIP30 (interdisciplinary)
- PCIP31 (recreation and fitness)
- PCIP43 (law enforcement/protective services)
- PCIP44 (social service)
- PCIP46 (construction)
- PCIP47 (mechanic)
- PCIP48 (precision production)
- PCIP49 (transportation/logistics)
- PCIP52 (business)
<- college_scorecard_completely_prepped %>%
college_scorecard_majorstransmute(INSTNM,
SEX_SPECIFIC , LIB_ARTS = PCIP05 + PCIP13 + PCIP16 + PCIP23 + PCIP24 + PCIP38 + PCIP39 +PCIP42 + PCIP45 + PCIP50 + PCIP54,
STEM = PCIP01 + PCIP03 + PCIP11 + PCIP14 + PCIP15 + PCIP26 + PCIP27+PCIP29 + PCIP40 + PCIP41 + PCIP51,
PROF_VOC_OTHERS = PCIP04 + PCIP09 + PCIP10 + PCIP12 + PCIP19 + PCIP22 +
+ PCIP30 + PCIP31 + PCIP43 + PCIP44 + PCIP46 + PCIP47 + PCIP48 + PCIP49 + PCIP52 )
PCIP25
kable(col.names = c("Name", 'Institution Type', 'Liberal Arts', 'STEM', "Professional & Vocation"), head(college_scorecard_majors),
align = 'cc', caption = "Table 1.5: Sample Proportion of Degree's Conferred Per Instituion")
Name | Institution Type | Liberal Arts | STEM | Professional & Vocation |
---|---|---|---|---|
Alabama A & M University | CO-ED | 0.2564 | 0.3944 | 0.3490 |
University of Alabama at Birmingham | CO-ED | 0.2751 | 0.4438 | 0.2813 |
Amridge University | CO-ED | 0.2462 | 0.0000 | 0.7538 |
University of Alabama in Huntsville | CO-ED | 0.1559 | 0.5921 | 0.2520 |
Alabama State University | CO-ED | 0.2661 | 0.3493 | 0.3846 |
The University of Alabama | CO-ED | 0.1945 | 0.2968 | 0.5086 |
summary(lm(LIB_ARTS ~ SEX_SPECIFIC, data = college_scorecard_majors))
##
## Call:
## lm(formula = LIB_ARTS ~ SEX_SPECIFIC, data = college_scorecard_majors)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.9298 -0.2005 -0.1551 0.1407 0.7996
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.200534 0.003495 57.374 < 2e-16 ***
## SEX_SPECIFICMEN ONLY 0.729223 0.034722 21.002 < 2e-16 ***
## SEX_SPECIFICWOMEN ONLY 0.263843 0.044608 5.915 3.52e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2631 on 5756 degrees of freedom
## (287 observations deleted due to missingness)
## Multiple R-squared: 0.0761, Adjusted R-squared: 0.07578
## F-statistic: 237.1 on 2 and 5756 DF, p-value: < 2.2e-16
summary(lm(STEM ~ SEX_SPECIFIC, data = college_scorecard_majors))
##
## Call:
## lm(formula = STEM ~ SEX_SPECIFIC, data = college_scorecard_majors)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.34488 -0.34488 -0.06808 0.17039 0.65512
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.344879 0.004443 77.621 < 2e-16 ***
## SEX_SPECIFICMEN ONLY -0.327114 0.044139 -7.411 1.44e-13 ***
## SEX_SPECIFICWOMEN ONLY -0.011948 0.056706 -0.211 0.833
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3344 on 5756 degrees of freedom
## (287 observations deleted due to missingness)
## Multiple R-squared: 0.009455, Adjusted R-squared: 0.009111
## F-statistic: 27.47 on 2 and 5756 DF, p-value: 1.335e-12
summary(lm(PROF_VOC_OTHERS ~ SEX_SPECIFIC, data = college_scorecard_majors))
##
## Call:
## lm(formula = PROF_VOC_OTHERS ~ SEX_SPECIFIC, data = college_scorecard_majors)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.4500 -0.2691 -0.0723 0.3030 0.5501
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.449999 0.004682 96.114 < 2e-16 ***
## SEX_SPECIFICMEN ONLY -0.432007 0.046512 -9.288 < 2e-16 ***
## SEX_SPECIFICWOMEN ONLY -0.247279 0.059754 -4.138 3.55e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3524 on 5756 degrees of freedom
## (287 observations deleted due to missingness)
## Multiple R-squared: 0.01755, Adjusted R-squared: 0.0172
## F-statistic: 51.4 on 2 and 5756 DF, p-value: < 2.2e-16
The liberal arts model shows that both men-only and women-only institutions have a positive relationship with proportion of liberal arts degrees. This makes sense because many of these institutions (especially men-only) award religious and humanities degrees. For the STEM model, both men-only and women-only colleges have a negative relationship with percentage of STEM degrees. However, only the men-only relationship with STEM degrees is significant. Lastly, both men-only and women-only institutions have significant negative relationships with the percentage of professional and vocational degrees (men-only more significant).
Let’s take a look at what this looks like visually:
%>%
college_scorecard_majors group_by(SEX_SPECIFIC) %>%
summarize(mean_stem = mean(STEM, na.rm= TRUE), mean_lb_arts = mean(LIB_ARTS, na.rm= TRUE),mean_prof_voc = mean(PROF_VOC_OTHERS, na.rm= TRUE)) %>%
plot_ly(x = ~SEX_SPECIFIC, y = ~mean_stem, type= 'bar', name= 'STEM Degrees', marker = list(color = "#44668b")) %>%
add_trace(y = ~mean_lb_arts, name= 'Liberal Arts Degrees', marker = list(color = "#f1cb35"))%>%
add_trace(y = ~mean_prof_voc, name= 'Professional and Vocational Degrees', marker = list(color = "#bdb8b8"))%>%
layout(barmode = "group",
xaxis = list(title = "College Type"),
yaxis = list(title = "Proportion of Degrees Conferred"),
font = list(family = "Georgia"))
I think it is pretty safe to say that if you are attending one of these male-only colleges, you are probably going to end up with a liberal arts degree.
One limitation of these methods is the fact that we could not get data that showed the degrees conferred by gender for co-ed institutions. Although co-ed institutions give more STEM degrees than women-only colleges, we need to keep in mind that this is the average for both genders. Because the number of men in STEM is usually higher than women in general, we could hypothesize that the average proportion of degrees conferred to women in co-ed colleges is lower than what is seen above.
Earning Outcomes
Here, we want to know if earning outcomes are affected by sex-specific schools.
‘MN_EARN_WNE_P10’ (Mean earnings of students working and not enrolled 10 years after entry) ‘MD_EARN_WNE_P10’ (Median earnings of students working and not enrolled 10 years after entry)
<- bind_rows(
earning_outcomes_table tidy(lm(MN_EARN_WNE_P10 ~ SEX_SPECIFIC, data= college_scorecard_completely_prepped)),
tidy(lm(MD_EARN_WNE_P10~ SEX_SPECIFIC, data= college_scorecard_completely_prepped))
)
earning_outcomes_table
From this table, it looks like both mean and median earning outcomes are positively associated with women-only colleges, albeit slightly. Both measurements are negatively associated with men-only colleges, but they are insignificant. One thing to note is that the median outcome for men-only colleges is almost significant.
%>%
college_scorecard_completely_prepped group_by(SEX_SPECIFIC) %>%
summarize(avg_mean = mean(MN_EARN_WNE_P10, na.rm= TRUE),
avg_md = mean(MD_EARN_WNE_P10, na.rm= TRUE)) %>%
plot_ly(x = ~SEX_SPECIFIC, y = ~avg_mean, type= 'bar', name= 'Mean Earning ', marker = list(color = "#44668b")) %>%
add_trace(y = ~avg_md, name= 'Median Earnings', marker = list(color = "#f1cb35"))%>%
layout(barmode = "group",
xaxis = list(title = "College Type"),
yaxis = list(title = "Average Earnings After 10 Years"),
font = list(family = "Georgia"))
Completion
Next, does being at a sex-specific school affect completion rates for different ethnic groups?
Variables:
C150_4_WHITE (Completion rate for first-time, full-time students at four-year institutions (150% of expected time to completion) for white students)
C150_4_BLACK (Completion rate for first-time, full-time students at four-year institutions (150% of expected time to completion) for black students)
C150_4_HISP (Completion rate for first-time, full-time students at four-year institutions (150% of expected time to completion) for Hispanic students)
C150_4_ASIAN (Completion rate for first-time, full-time students at four-year institutions (150% of expected time to completion) for Asian students)
C150_4_AIAN (Completion rate for first-time, full-time students at four-year institutions (150% of expected time to completion) for American Indian/Alaska Native students)
C150_4_NHPI (Completion rate for first-time, full-time students at four-year institutions (150% of expected time to completion) for Native Hawaiian/Pacific Islander students)
C150_4_2MOR (Completion rate for first-time, full-time students at four-year institutions (150% of expected time to completion) for students of two-or-more-races)
C150_4_NRA (Completion rate for first-time, full-time students at four-year institutions (150% of expected time to completion) for non-resident alien students)
C150_4_UNKN (Completion rate for first-time, full-time students at four-year institutions (150% of expected time to completion) for students whose race is unknown)
summary(lm(C150_4_WHITE~ SEX_SPECIFIC, data= college_scorecard_completely_prepped))
##
## Call:
## lm(formula = C150_4_WHITE ~ SEX_SPECIFIC, data = college_scorecard_completely_prepped)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.64224 -0.14239 0.01976 0.15054 0.61240
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.542386 0.004963 109.294 < 2e-16 ***
## SEX_SPECIFICMEN ONLY -0.154786 0.031111 -4.975 7.04e-07 ***
## SEX_SPECIFICWOMEN ONLY 0.099857 0.039238 2.545 0.011 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2236 on 2113 degrees of freedom
## (3930 observations deleted due to missingness)
## Multiple R-squared: 0.01481, Adjusted R-squared: 0.01387
## F-statistic: 15.88 on 2 and 2113 DF, p-value: 1.433e-07
summary(lm(C150_4_BLACK~ SEX_SPECIFIC, data= college_scorecard_completely_prepped))
##
## Call:
## lm(formula = C150_4_BLACK ~ SEX_SPECIFIC, data = college_scorecard_completely_prepped)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.52134 -0.18285 -0.02405 0.16015 0.59495
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.405052 0.005715 70.875 < 2e-16 ***
## SEX_SPECIFICMEN ONLY 0.184268 0.112312 1.641 0.10102
## SEX_SPECIFICWOMEN ONLY 0.116285 0.042778 2.718 0.00662 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2508 on 1963 degrees of freedom
## (4080 observations deleted due to missingness)
## Multiple R-squared: 0.005079, Adjusted R-squared: 0.004065
## F-statistic: 5.01 on 2 and 1963 DF, p-value: 0.006753
summary(lm(C150_4_HISP~ SEX_SPECIFIC, data= college_scorecard_completely_prepped))
##
## Call:
## lm(formula = C150_4_HISP ~ SEX_SPECIFIC, data = college_scorecard_completely_prepped)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.58758 -0.16029 -0.00389 0.16911 0.52788
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.476089 0.005599 85.024 < 2e-16 ***
## SEX_SPECIFICMEN ONLY -0.003972 0.071672 -0.055 0.95581
## SEX_SPECIFICWOMEN ONLY 0.111494 0.042211 2.641 0.00832 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2475 on 1998 degrees of freedom
## (4045 observations deleted due to missingness)
## Multiple R-squared: 0.003483, Adjusted R-squared: 0.002486
## F-statistic: 3.492 on 2 and 1998 DF, p-value: 0.03063
summary(lm(C150_4_ASIAN~ SEX_SPECIFIC, data= college_scorecard_completely_prepped))
##
## Call:
## lm(formula = C150_4_ASIAN ~ SEX_SPECIFIC, data = college_scorecard_completely_prepped)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.65546 -0.17798 0.03312 0.23202 0.52488
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.566881 0.007266 78.020 <2e-16 ***
## SEX_SPECIFICMEN ONLY -0.091765 0.122158 -0.751 0.453
## SEX_SPECIFICWOMEN ONLY 0.088580 0.054137 1.636 0.102
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2987 on 1724 degrees of freedom
## (4319 observations deleted due to missingness)
## Multiple R-squared: 0.001888, Adjusted R-squared: 0.0007303
## F-statistic: 1.631 on 2 and 1724 DF, p-value: 0.1961
summary(lm(C150_4_AIAN~ SEX_SPECIFIC, data= college_scorecard_completely_prepped))
##
## Call:
## lm(formula = C150_4_AIAN ~ SEX_SPECIFIC, data = college_scorecard_completely_prepped)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.43186 -0.43186 -0.04886 0.31814 0.64444
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.43186 0.01007 42.887 <2e-16 ***
## SEX_SPECIFICMEN ONLY 0.17529 0.26375 0.665 0.506
## SEX_SPECIFICWOMEN ONLY -0.07630 0.09676 -0.789 0.430
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3727 on 1384 degrees of freedom
## (4659 observations deleted due to missingness)
## Multiple R-squared: 0.0007709, Adjusted R-squared: -0.0006731
## F-statistic: 0.5339 on 2 and 1384 DF, p-value: 0.5865
Here is the summary of the summary tables:
White Student Completion Rate:
- significant negative relationship for men-only
- slightly significant positive relationship for women-only
Black Student Completion Rate:
- moderately significant positive relationship for women-only
Hispanic Student Completion Rate:
- moderately significant positive relationship for women-only
Asian Student Completion Rate:
- no significant relationship for both men-only and women-only colleges
American Indian/ Alaskan Native Student Completion Rate:
- no significant relationship for both men-only and women-only colleges
Here is what it looks like visually:
%>%
college_scorecard_completely_prepped group_by(SEX_SPECIFIC) %>%
summarize(avg_completion_w = mean(C150_4_WHITE, na.rm= TRUE),
avg_completion_b = mean(C150_4_BLACK, na.rm= TRUE),
avg_completion_h = mean(C150_4_HISP, na.rm= TRUE),
avg_completion_as = mean(C150_4_ASIAN, na.rm= TRUE),
avg_completion_ai = mean(C150_4_AIAN, na.rm= TRUE)) %>%
plot_ly(x = ~SEX_SPECIFIC, y = ~avg_completion_w, type= 'bar', name= 'White', marker = list(color = "#44668b")) %>%
add_trace(y = ~avg_completion_b, name= 'Black', marker = list(color = "#f1cb35"))%>%
add_trace(y = ~avg_completion_h, name= 'Hispanic', marker = list(color = "#1298d0"))%>%
add_trace(y = ~avg_completion_as, name= 'Asian', marker = list(color = "#cc8834"))%>%
add_trace(y = ~avg_completion_ai, name= 'American Indian/ Alaska Native', marker = list(color = "#bdb8b8"))%>%
layout(barmode = "group",
xaxis = list(title = "College Type"),
yaxis = list(title = "Average Completion Rate"),
font = list(family = "Georgia"))
summary(lm(C150_4_NHPI~ SEX_SPECIFIC, data= college_scorecard_completely_prepped))
##
## Call:
## lm(formula = C150_4_NHPI ~ SEX_SPECIFIC, data = college_scorecard_completely_prepped)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.61904 -0.44404 -0.04404 0.55596 0.55596
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.44404 0.01384 32.076 <2e-16 ***
## SEX_SPECIFICWOMEN ONLY 0.17501 0.15521 1.128 0.26
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.409 on 878 degrees of freedom
## (5166 observations deleted due to missingness)
## Multiple R-squared: 0.001446, Adjusted R-squared: 0.0003085
## F-statistic: 1.271 on 1 and 878 DF, p-value: 0.2598
summary(lm(C150_4_2MOR~ SEX_SPECIFIC, data= college_scorecard_completely_prepped))
##
## Call:
## lm(formula = C150_4_2MOR ~ SEX_SPECIFIC, data = college_scorecard_completely_prepped)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.55539 -0.20622 0.02708 0.19378 0.52708
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.472925 0.006913 68.414 <2e-16 ***
## SEX_SPECIFICMEN ONLY 0.227075 0.198793 1.142 0.2535
## SEX_SPECIFICWOMEN ONLY 0.082466 0.049396 1.669 0.0952 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.281 on 1684 degrees of freedom
## (4359 observations deleted due to missingness)
## Multiple R-squared: 0.002413, Adjusted R-squared: 0.001228
## F-statistic: 2.037 on 2 and 1684 DF, p-value: 0.1308
summary(lm(C150_4_NRA~ SEX_SPECIFIC, data= college_scorecard_completely_prepped))
##
## Call:
## lm(formula = C150_4_NRA ~ SEX_SPECIFIC, data = college_scorecard_completely_prepped)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.64515 -0.17095 0.03585 0.21235 0.71789
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.587646 0.007393 79.491 < 2e-16 ***
## SEX_SPECIFICMEN ONLY -0.305537 0.062997 -4.850 1.36e-06 ***
## SEX_SPECIFICWOMEN ONLY 0.057507 0.054682 1.052 0.293
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2867 on 1550 degrees of freedom
## (4493 observations deleted due to missingness)
## Multiple R-squared: 0.01575, Adjusted R-squared: 0.01448
## F-statistic: 12.4 on 2 and 1550 DF, p-value: 4.551e-06
summary(lm(C150_4_UNKN~ SEX_SPECIFIC, data= college_scorecard_completely_prepped))
##
## Call:
## lm(formula = C150_4_UNKN ~ SEX_SPECIFIC, data = college_scorecard_completely_prepped)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.59045 -0.22123 0.00957 0.21547 0.50957
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.490429 0.007379 66.459 <2e-16 ***
## SEX_SPECIFICMEN ONLY -0.144262 0.171801 -0.840 0.4012
## SEX_SPECIFICWOMEN ONLY 0.100025 0.058769 1.702 0.0889 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2973 on 1649 degrees of freedom
## (4394 observations deleted due to missingness)
## Multiple R-squared: 0.002189, Adjusted R-squared: 0.0009787
## F-statistic: 1.809 on 2 and 1649 DF, p-value: 0.1642
Another summary of the summaries:
Native Hawaiian/Pacific Islander Completion Rate:
- no significant relationship for both men-only and women-only colleges
2 or More Races Completion Rate:
- no significant relationship for both men-only and women-only colleges
Nonresident Aliens Completion Rate:
- very significant negative relationship for men-only
Unknown Completion Rate:
- no significant relationship for both men-only and women-only colleges
Here is the visual:
%>%
college_scorecard_completely_prepped group_by(SEX_SPECIFIC) %>%
summarize(avg_completion_nhpi = mean(C150_4_NHPI, na.rm= TRUE),
avg_completion_2M = mean(C150_4_2MOR, na.rm= TRUE),
avg_completion_nra = mean(C150_4_NRA, na.rm= TRUE),
avg_completion_u = mean(C150_4_UNKN, na.rm= TRUE))%>%
plot_ly(x = ~SEX_SPECIFIC, y = ~avg_completion_nhpi, type= 'bar', name= 'Native Hawaiian/Pacific Islander', marker = list(color = "#44668b")) %>%
add_trace(y = ~avg_completion_2M, name= '2 or More Races', marker = list(color = "#1298d0"))%>%
add_trace(y = ~avg_completion_nra, name= 'Nonresident Aliens', marker = list(color = "#f1cb35"))%>%
add_trace(y = ~avg_completion_u, name= 'Unknown', marker = list(color = "#cc8834"))%>%
layout(barmode = "group",
xaxis = list(title = "College Type"),
yaxis = list(title = "Average Completion Rate"),
font = list(family = "Georgia"))
Retention
Do sex-specific schools affect retention?
Here is the variable we are looking at:
RET_FT4_POOLED (First-time, full-time student retention rate at four-year institutions)
summary(lm(RET_FT4_POOLED~ SEX_SPECIFIC, data= college_scorecard_completely_prepped))
##
## Call:
## lm(formula = RET_FT4_POOLED ~ SEX_SPECIFIC, data = college_scorecard_completely_prepped)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.72061 -0.07969 0.02494 0.10271 0.27939
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.720612 0.003528 204.247 <2e-16 ***
## SEX_SPECIFICMEN ONLY 0.033652 0.022028 1.528 0.127
## SEX_SPECIFICWOMEN ONLY 0.047545 0.027238 1.746 0.081 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1598 on 2137 degrees of freedom
## (3906 observations deleted due to missingness)
## Multiple R-squared: 0.002461, Adjusted R-squared: 0.001528
## F-statistic: 2.636 on 2 and 2137 DF, p-value: 0.07186
It looks like there is no significant relationship.
Faculty Employment
Next, we will be investigating the question: “Is there a difference in faculty salary and employment at different colleges?”
Variables:
- AVGFACSAL (Average faculty salary)
- PFTFAC (Proportion of faculty that is full-time)
summary(lm(AVGFACSAL~ SEX_SPECIFIC, data= college_scorecard_completely_prepped))
##
## Call:
## lm(formula = AVGFACSAL ~ SEX_SPECIFIC, data = college_scorecard_completely_prepped)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6928.9 -1657.4 -289.9 1336.6 13555.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6928.88 40.76 170.008 < 2e-16 ***
## SEX_SPECIFICMEN ONLY -2217.14 325.38 -6.814 1.09e-11 ***
## SEX_SPECIFICWOMEN ONLY 1020.66 428.12 2.384 0.0172 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2521 on 3920 degrees of freedom
## (2123 observations deleted due to missingness)
## Multiple R-squared: 0.01322, Adjusted R-squared: 0.01271
## F-statistic: 26.25 on 2 and 3920 DF, p-value: 4.715e-12
summary(lm(PFTFAC~ SEX_SPECIFIC, data= college_scorecard_completely_prepped))
##
## Call:
## lm(formula = PFTFAC ~ SEX_SPECIFIC, data = college_scorecard_completely_prepped)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.71468 -0.24868 -0.02378 0.28112 0.40862
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.591377 0.004878 121.228 < 2e-16 ***
## SEX_SPECIFICMEN ONLY 0.248299 0.050582 4.909 9.57e-07 ***
## SEX_SPECIFICWOMEN ONLY 0.012831 0.049129 0.261 0.794
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2892 on 3580 degrees of freedom
## (2463 observations deleted due to missingness)
## Multiple R-squared: 0.006699, Adjusted R-squared: 0.006144
## F-statistic: 12.07 on 2 and 3580 DF, p-value: 5.958e-06
The first summary table shows a significant negative relationship between faculty salary and being employed at a men-only institution. In contrast, there is a positive relationship (albeit slightly significant) between faculty salary and being at a women-only college. Additionally, the second table shows a very significant positive relationship between the proportion of full-time faculty and men-only institutions.
<- college_scorecard_completely_prepped %>%
faculty_data group_by(SEX_SPECIFIC) %>%
summarize(avg_fascal = mean(AVGFACSAL, na.rm= TRUE),
avg_pftfac = mean(PFTFAC, na.rm= TRUE))
%>%
faculty_data plot_ly(x = ~SEX_SPECIFIC, y = ~avg_fascal, type= 'bar', marker = list(color = "#1298d0"))%>%
layout(barmode = "group",
xaxis = list(title = "College Type"),
yaxis = list(title = "Average Faculty Salary (Monthly)"),
font = list(family = "Georgia"))
%>%
faculty_data plot_ly(x = ~SEX_SPECIFIC, y = ~avg_pftfac, type= 'bar', marker = list(color = "#cc8834"))%>%
layout(barmode = "group",
xaxis = list(title = "College Type"),
yaxis = list(title = "Average Proportion of Full Time Faculty"),
font = list(family = "Georgia"))
So men-only colleges pay you less on average but expect you to stick around.
Household Income
How is household income related to sex-specific schools?
Our Variable:
- MEDIAN_HH_INC (Median household income)
summary(hh_income_mod_male <- lm( MEDIAN_HH_INC ~ SEX_SPECIFIC, data = college_scorecard_completely_prepped))
##
## Call:
## lm(formula = MEDIAN_HH_INC ~ SEX_SPECIFIC, data = college_scorecard_completely_prepped)
##
## Residuals:
## Min 1Q Median 3Q Max
## -42219 -8572 -293 8398 42861
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 58009.9 200.2 289.706 < 2e-16 ***
## SEX_SPECIFICMEN ONLY -4385.3 1807.9 -2.426 0.0153 *
## SEX_SPECIFICWOMEN ONLY 8879.9 2240.8 3.963 7.52e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13200 on 4434 degrees of freedom
## (1609 observations deleted due to missingness)
## Multiple R-squared: 0.004888, Adjusted R-squared: 0.004439
## F-statistic: 10.89 on 2 and 4434 DF, p-value: 1.914e-05
%>%
college_scorecard_completely_prepped group_by(SEX_SPECIFIC) %>%
summarize(avg_median_hh = mean(MEDIAN_HH_INC, na.rm= TRUE)) %>%
plot_ly(x = ~SEX_SPECIFIC, y = ~avg_median_hh, type= 'bar', marker = list(color = "#1298d0"))%>%
layout(barmode = "group",
xaxis = list(title = "College Type"),
yaxis = list(title = "Average Median Household Income"),
font = list(family = "Georgia"))
It looks like there is a significant positive relationship between median household income and attending a women-only college. Conversely, there is a slightly negative relationship between median household income and attending a men-only college.
Conclusion
Overall, this project was an opportunity to use a variety of new tools to answer some interesting questions. I made use of spatial tools, data manipulation tools, and interactive graphical tools. Considering the extensive amount of findings in this project, I think it is safe to say that there are key features that distinguish men-only and women-only colleges. Here is a summary table of all variables that had significant relationships with the particular school types:
MEN ONLY | WOMEN ONLY |
---|---|
Religious Affiliation (positive) | Religious Affiliation ( positive ) |
Proportion of White Students (strong positive) | Proportion of Nonresident Aliens (weak positive) |
Proportion of Black Students (moderate negative) | Percentage Liberal Arts Degrees (strong positive) |
Admission Rate (strong positive ) | Percentage Professional & Vocational Degrees (strong negative) |
Percentage Liberal Arts (strong positive) | Mean & Median Earning Outcomes (weak positive) |
Percentage STEM Degrees (strong negative) | White Student Completion Rate (weak positive) |
Professional & Vocational Degrees (strong negative) | Black Student Completion Rate (moderate positive) |
White Student Completion Rate (strong negative) | Hispanic Student Completion Rate (moderate positive) |
Nonresident Alien Completion Rate (strong negative) | Faculty Salary (weak positive) |
Faculty Salary (strong negative ) | Median Household Income (strong positive ) |
Proportion of Full Time Faculty (strong positive) | |
Median Household Income (weak negative) |
With all of these results in mind, I think the next step would be to use some the variables in this table to create more complex multivariate logistic regression models that could predict sex-specific institutions.
Source
The data for this project comes from the US Department of Education’s College Score Card: https://collegescorecard.ed.gov/data/
The academic division groupings came from this wikipedia page: https://en.wikipedia.org/wiki/List_of_academic_fields#Professions_and_applied_sciences
US College Information: https://www.usnews.com/education/best-colleges/articles/how-many-universities-are-in-the-us-and-why-that-number-is-changing
Mixed Sex Education: https://en.wikipedia.org/wiki/Mixed-sex_education