Introduction
Using 2014-2015 data gathered from the California
Department of Education, we will be using R
,
Python
, and SQL
to
gain insight into what kind of factors are associated with high school closures
in the state of California.
The purpose of this brief analysis is to address the query above, generate ideas for more in depth analyses, and practice using R, Python, and SQL simultaneously.
Packages, Modules, and SQL Database Connection
Load R packages.
library(DBI)# Load the DBI package
library(tidyverse)
library(data.table)
library(lubridate)
library(reticulate)
library(extrafont)
library(knitr)
library(extrafont)
library(lattice)
library(gridExtra)
library(fuzzyjoin)
library(formattable)
library(leaflet)
library(leaflet.extras)
library(htmltools)
library(knitr)
use_python('C:/Users/laryl/AppData/Local/Programs/Python/Python311', required = TRUE)
Load Python Modules.
import pandas as pd
import numpy as np
import pprint as pp
from IPython.display import display
import pingouin as pg
from scipy.stats import t
Load SQLite connection and see the tables within the database.
library(RSQLite)
<- "C:/Users/laryl/Documents/data_work/data/cdeschools.sqlite"
schools <- DBI::dbConnect(RSQLite::SQLite(), dbname = schools)
con <-dbListTables(con)# Build a vector of the list of table names: tables
tables tables
## [1] "frpm" "satscores" "schools"
Data Preparation
Now we are going to query all of the data we need from the SQLite database taking care to add and clean the columns we need.
-- Join School table with Free-or-Reduced-Price Meal and SAT Score tables while cleaning the data
WITH combined_tables AS (
SELECT School, StatusType, County, City, District,
Latitude, Longitude,OpenDate, ClosedDate, CAST(ClosedDate AS DATE) - CAST(OpenDate AS DATE) AS YearsOpen,
COALESCE(FundingType, 'No Funding Info') AS FundingType,
Year], [Enrollment (K-12)], [Percent (%) Eligible Free (Ages 5-17)],[High Grade],
[Academic
enroll12 , NumTstTakr, AvgScrRead, AvgScrWrite, AvgScrMath, PctGE1500FROM schools AS sl
LEFT JOIN frpm AS fm
ON School = [School Name] AND District = [District Name] AND County = [County Name]
LEFT JOIN satscores AS ss
ON School = sname AND District =dname AND County = cname)
-- Take table from above and add combined SAT score column, a difference from California Schools SAT Average column, and a region column
SELECT *, (AvgScrRead + AvgScrWrite +AvgScrMath ) AS TotalScr,
+ AvgScrWrite +AvgScrMath ) - (SELECT AVG(AvgScrRead + AvgScrWrite +AvgScrMath )FROM satscores) AS ScrDiff,
(AvgScrRead CASE WHEN County IN ('Butte', 'Colusa', 'El Dorado', 'Glenn', 'Lassen', 'Modoc', 'Nevada', 'Placer', 'Plumas', 'Sacramento', 'Shasta', 'Sierra', 'Siskiyou', 'Sutter', 'Tehama', 'Yolo', 'Yuba') THEN 'Superior California'
WHEN County IN ('Del Norte', 'Humboldt', 'Lake', 'Mendocino', 'Napa', 'Sonoma', 'Trinity') THEN 'North Coast'
WHEN County IN ('Alameda', 'Contra Costa', 'Marin', 'San Francisco', 'San Mateo', 'Santa Clara', 'Solano') THEN 'San Francisco Bay Area'
WHEN County IN ('Alpine', 'Amador', 'Calaveras', 'Madera', 'Mariposa', 'Merced', 'Mono', 'San Joaquin', 'Stanislaus', 'Tuolumne') THEN 'Northern San Joaquin Valley'
WHEN County IN ('Monterey', 'San Benito', 'San Luis Obispo', 'Santa Barbara', 'Santa Cruz', 'Ventura') THEN 'Central Coast'
WHEN County IN ('Fresno', 'Inyo', 'Kern', 'Kings', 'Tulare') THEN 'Southern San Joaquin Valley'
WHEN County IN ('Riverside', 'San Bernardino') THEN 'Inland Empire'
WHEN County IN ('Los Angeles') THEN 'Los Angeles'
WHEN County IN ('Orange') THEN 'Orange'
WHEN County IN ('Imperial', 'San Diego') THEN 'San Diego - Imperial' END
AS Region
FROM combined_tables
We will use python to filter the data for high schools (where grade 12 is the highest).
= r.ca_schools_data.replace(-2147483648, np.NaN)
ca_schools = ca_schools[ca_schools["High Grade"] == '12']
ca_schools_filtered ca_schools_filtered.info()
## <class 'pandas.core.frame.DataFrame'>
## Int64Index: 2569 entries, 1 to 17685
## Data columns (total 24 columns):
## # Column Non-Null Count Dtype
## --- ------ -------------- -----
## 0 School 2569 non-null object
## 1 StatusType 2569 non-null object
## 2 County 2569 non-null object
## 3 City 2569 non-null object
## 4 District 2569 non-null object
## 5 Latitude 2546 non-null float64
## 6 Longitude 2546 non-null float64
## 7 OpenDate 2569 non-null object
## 8 ClosedDate 2569 non-null object
## 9 YearsOpen 90 non-null float64
## 10 FundingType 2569 non-null object
## 11 Academic Year 2569 non-null object
## 12 Enrollment (K-12) 2569 non-null float64
## 13 Percent (%) Eligible Free (Ages 5-17) 2533 non-null float64
## 14 High Grade 2569 non-null object
## 15 enroll12 1649 non-null float64
## 16 NumTstTakr 1649 non-null float64
## 17 AvgScrRead 1189 non-null float64
## 18 AvgScrWrite 1189 non-null float64
## 19 AvgScrMath 1189 non-null float64
## 20 PctGE1500 1189 non-null float64
## 21 TotalScr 1189 non-null float64
## 22 ScrDiff 1189 non-null float64
## 23 Region 2569 non-null object
## dtypes: float64(13), object(11)
## memory usage: 501.8+ KB
Exploratory Analysis
Summarizing the Data
During the 2014-2015 school year there were 2479 active high schools and 86 high schools that were closed.
We are focused on the StatusType
column which has 2 main
categories: Active and Closed. The
other 2 categories were filtered out because there were too few schools
or no high schools with those labels. Here is an overall summary of the
average differences:
= ca_schools_filtered.groupby('StatusType')[['FundingType']].value_counts(normalize=True).unstack().reset_index()
status_summary = status_summary[status_summary["StatusType"] != "Merged"][['StatusType', 'Directly funded', 'Locally funded', 'No Funding Info']]
status_summary_filtered = ca_schools_filtered['StatusType'].value_counts().to_frame()
funding_counts
= ca_schools_filtered.pivot_table(values= [ 'Enrollment (K-12)', 'enroll12', 'Percent (%) Eligible Free (Ages 5-17)',
ca_pivot_table'NumTstTakr', 'AvgScrRead', 'AvgScrWrite', 'AvgScrMath', 'TotalScr', 'ScrDiff', 'PctGE1500'], index= 'StatusType', aggfunc= [np.mean] ).reset_index()
= ca_pivot_table[ca_pivot_table["StatusType"] != "Merged"].merge(status_summary_filtered, on = "StatusType", how = "left") ca_summary
## <string>:1: FutureWarning: merging between different levels is deprecated and will be removed in a future version. (2 levels on the left, 1 on the right)
1]], axis=1, inplace=True) ca_summary.drop(ca_summary.columns[[
$ca_summary py
From this table we can see clear differences in SAT score-related variables, enrollment-related variables, and funding-related variables.
Are These Significant Differences?
Our next step in this exploratory analysis would be to choose some of the variables and test how significant the differences are. The reason we are testing 1 variable from each of the 3 types of variables is because these variables are probably correlated. It doesn’t make sense to check all of them when they would produce relatively similar results. Finding the most significant variables is beyond the scope of this brief analysis. To keep things simple, our alpha value will be 0.05.
To check if there is a relationship between high school status and funding type, we are going to use a chi-square test of independence.
= pg.chi2_independence(ca_schools_filtered[ca_schools_filtered["StatusType"] != "Merged"],
expected, observed, stats = 'StatusType',
x= 'FundingType',
y=False) correction
## C:\Users\laryl\AppData\Local\Programs\Python\PYTHON~1\Lib\site-packages\pingouin\contingency.py:150: UserWarning: Low count on observed frequencies.
## warnings.warn("Low count on {} frequencies.".format(name))
## C:\Users\laryl\AppData\Local\Programs\Python\PYTHON~1\Lib\site-packages\pingouin\contingency.py:150: UserWarning: Low count on expected frequencies.
## warnings.warn("Low count on {} frequencies.".format(name))
## C:\Users\laryl\AppData\Local\Programs\Python\PYTHON~1\Lib\site-packages\scipy\stats\_stats_py.py:7416: RuntimeWarning: divide by zero encountered in power
## terms = f_obs * ((f_obs / f_exp)**lambda_ - 1)
## C:\Users\laryl\AppData\Local\Programs\Python\PYTHON~1\Lib\site-packages\scipy\stats\_stats_py.py:7416: RuntimeWarning: invalid value encountered in multiply
## terms = f_obs * ((f_obs / f_exp)**lambda_ - 1)
## C:\Users\laryl\AppData\Local\Programs\Python\PYTHON~1\Lib\site-packages\scipy\stats\_stats_py.py:7413: RuntimeWarning: divide by zero encountered in divide
## terms = 2.0 * special.xlogy(f_exp, f_exp / f_obs)
stats
## test lambda chi2 dof pval cramer power
## 0 pearson 1.000000 17.433032 3.0 0.000576 0.082441 0.953058
## 1 cressie-read 0.666667 16.266950 3.0 0.001000 0.079636 0.938077
## 2 log-likelihood 0.000000 14.388828 3.0 0.002421 0.074898 0.904738
## 3 freeman-tukey -0.500000 NaN 3.0 NaN NaN NaN
## 4 mod-log-likelihood -1.000000 inf 3.0 0.000000 inf NaN
## 5 neyman -2.000000 NaN 3.0 NaN NaN NaN
Taking a look at the pval column from this table, there appears to be a significant association between status and funding.
Next, we are interested in performing a t-test to examine whether there is a statistically significant association between status and SAT score total.
= ca_schools_filtered[ca_schools_filtered['StatusType'] == "Active"]["TotalScr"].mean()
x_bar_active= ca_schools_filtered[ca_schools_filtered['StatusType'] == "Closed"]["TotalScr"].mean()
x_bar_closed
= len(ca_schools_filtered[ca_schools_filtered['StatusType'] == "Active"])
n_active= len(ca_schools_filtered[ca_schools_filtered['StatusType'] == "Closed"])
n_closed
= ca_schools_filtered[ca_schools_filtered['StatusType'] == "Active"]["TotalScr"].std()
s_active= ca_schools_filtered[ca_schools_filtered['StatusType'] == "Closed"]["TotalScr"].std()
s_closed
= x_bar_active - x_bar_closed
numerator = np.sqrt(s_active **2 / n_active + s_closed **2 / n_closed)
denominator = numerator/denominator
t_stat = n_active + n_closed - 2
deg_freedom print('p-value =', 1 - t.cdf(t_stat, df= deg_freedom))
## p-value = 0.014652696251728048
Once again since the p-value falls below our significance threshold, there is evidence to support an association between status and SAT scores.
Lastly, we are going to perform another t-test, this time, focusing on the relationship between status and enrollment of 12th graders.
= ca_schools_filtered[ca_schools_filtered['StatusType'] == "Active"]["enroll12"].mean()
x_bar_active_takers = ca_schools_filtered[ca_schools_filtered['StatusType'] == "Closed"]["enroll12"].mean()
x_bar_closed_takers
= ca_schools_filtered[ca_schools_filtered['StatusType'] == "Active"]["enroll12"].std()
s_active_takers= ca_schools_filtered[ca_schools_filtered['StatusType'] == "Closed"]["enroll12"].std()
s_closed_takers
= x_bar_active_takers - x_bar_closed_takers
numerator = np.sqrt(s_active_takers **2 / n_active + s_closed_takers **2 / n_closed)
denominator = numerator/denominator
t_stat_takers = n_active + n_closed - 2
deg_freedom print('p-value =' , 1 - t.cdf(t_stat_takers, df= deg_freedom))
## p-value = 0.0
Once again we got another p-value below 0.05.
Recap and Way Forward
Funding, SAT scores, and enrollment are all associated with high school status. The average closed school is more likely to be directly funded, have an average SAT Score that is 53 points less than the overall California average, and have less students than active schools. Closed schools stayed open on average for about 13 years.
The time sensitivity for this project did prevent deeper dives into the data. For example, education type (like SOCType and Educational Option Type) could have been another factors that affect school status. The education types were not explored because they had numerous categories and would have taken much more time to sort through. Additionally, school locations could be associated with school closures as well. Here is a map of all of the schoolclosures in red
.
<- py$ca_schools_filtered%>%
active_ca_schools filter(StatusType == "Active")
<- py$ca_schools_filtered%>%
closed_ca_schools filter(StatusType == "Closed")
<- leaflet() %>%
map addProviderTiles("CartoDB") %>%
addCircleMarkers(data = active_ca_schools,
radius = 1,
color = "#8C92AC",
label = ~htmlEscape(active_ca_schools[["School"]])) %>%
addCircleMarkers(data = closed_ca_schools,
radius = 1,
color = "#ff5349",
label = ~htmlEscape(closed_ca_schools[["School"]]))%>%
setView(lat = 37.56258, lng = -121.9656, zoom = 5.5)
map
Here is a table summarizing the active and closed schools by regions (created using the counties from the original data).
#ca_schools_filtered[ca_schools_filtered['StatusType'] == "Closed"]['Region'].value_counts()
#ca_schools_filtered[ca_schools_filtered['StatusType'] == "Active"]['Region'].value_counts()
'StatusType'].isin(["Closed","Active"])].groupby('Region')[['StatusType']].value_counts().unstack().sort_values("Active") ca_schools_filtered[ca_schools_filtered[
## StatusType Active Closed
## Region
## Orange 107 2
## North Coast 149 4
## Central Coast 154 2
## San Diego - Imperial 192 10
## Northern San Joaquin Valley 222 4
## Southern San Joaquin Valley 235 6
## Inland Empire 239 4
## San Francisco Bay Area 327 18
## Superior California 339 19
## Los Angeles 515 17
Although regions with more schools tend to have more closures, particular regions could be more susceptible to school closures than others. Moreover, there is a possibility that regions as variable is too broad to show more granular county, city, or district patterns. It would be interesting to combine this data with wealth data gathered from external sources.
Finally, using the findings from above, an interesting future project could attempt to better identify which specific variables among the 3 categories are the most statistically significant. These variables could then be used to create a multiple logistic regression model that could help predict school closures for future years.