I just wanted to begin the process of understanding how to visualize data using Seaborn. This seemed like the right opportunity to try out some code. This is by no means a serious attempt at trying to answer this question. This is nothing more than the exploratory phase of a data analysis project.
# Import basic data manipulation and plotting packages
from math import *
import numpy as np
import statsmodels as sm
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
#Read in csv file and take a look at structure and data types
tv = pd.read_csv(r"C:/Users/laryl/Desktop/Data Sets/tv_shows.csv")
print(tv.info())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5611 entries, 0 to 5610 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Unnamed: 0 5611 non-null int64 1 Title 5611 non-null object 2 Year 5611 non-null int64 3 Age 3165 non-null object 4 IMDb 4450 non-null float64 5 Rotten Tomatoes 1011 non-null object 6 Netflix 5611 non-null int64 7 Hulu 5611 non-null int64 8 Prime Video 5611 non-null int64 9 Disney+ 5611 non-null int64 10 type 5611 non-null int64 dtypes: float64(1), int64(7), object(3) memory usage: 416.5+ KB None
#String manipulation to remove percent symbol
tv["Rotten Tomatoes"] = tv["Rotten Tomatoes"].str.strip("%")
#Check result of string manipulation
print(tv["Rotten Tomatoes"].head(5))
# Change data type from object to float
tv["Rotten Tomatoes"] = tv["Rotten Tomatoes"].astype('float')
print(tv.dtypes)
0 96 1 93 2 91 3 78 4 97 Name: Rotten Tomatoes, dtype: object Unnamed: 0 int64 Title object Year int64 Age object IMDb float64 Rotten Tomatoes float64 Netflix int64 Hulu int64 Prime Video int64 Disney+ int64 type int64 dtype: object
At this point, I just wanted to understand how the categories differed.
# Visualize Age Groups by IMDb Score
plt.style.use("dark_background")
blues = sns.set_palette("Blues")
sns.barplot(x = "Age",
y = "IMDb",
data = tv,
ci = None,
order =["all", "18+", "16+", "13+", "7+"],
palette= blues )
sns.set_context("notebook")
plt.show()
# Visualize Age Groups by Rotten Tomatoes Score
reds = sns.set_palette("Reds")
sns.barplot(x = "Age",
y = "Rotten Tomatoes",
data = tv,
ci = None,
order =["all", "18+", "16+", "13+", "7+"],
palette= reds )
plt.show()
Although the different scores show different results, what is shown is that there appears to be a significant dislike for tv shows targetted toward ealy teenage audiences. In my opinion, I could see why this would be the case (early teen shows are cringy). But I wanted to see why there was such a significant difference. So I decided to visualize the number of shows with for each category using a countplot.
plt.style.use("dark_background")
greens = sns.set_palette("Greens")
plots = sns.countplot(x = "Age",
data = tv,
order =["all", "18+", "16+", "13+", "7+"],
palette= greens)
for bar in plots.patches:
plots.annotate(format(bar.get_height(), '.2f'),
(bar.get_x() + bar.get_width() / 2,
bar.get_height()), ha='center', va='center',
size= 9, xytext=(0, 4),
textcoords='offset points')
plt.show()
With only 4 shows there is not a big enough sample size. Because barcharts can only tell a limited story, why not see what the distributions of these categories and their scores look like using a beeswarm plot?
sns.swarmplot(x = "Age",
y = 'IMDb',
data = tv,
size = 3)
<AxesSubplot:xlabel='Age', ylabel='IMDb'>
sns.swarmplot(y = "Age",
x = 'Rotten Tomatoes',
data = tv,
size = 3)
<AxesSubplot:xlabel='Rotten Tomatoes', ylabel='Age'>
Interestingly, although IMDb score distributions are similarly shaped, Rotten Tomatoes scores are quite different. For some reason Rotten Tomatoes scores are rarely given to tv shows that are for everyone. Additionally, if they are given a rating it is either faily high or really low. Without analyzing this too critically, this trend could be because Rotten Tomatoes prioritizes tv shows for more specific audiences. The author of the data set did not provide too much information on the data set, so a more rigorous analysis examining why the data behaves this way would require much more thorough research.
Apparently, without using a more thorough approach to answering this quesition, there is no significant difference between the age categories in terms of ratings. Because of the time constraints for this project, more robust statisitcal methods could not be used to provide a more convincing answer. In the next few weeks, I plan on coming back to this project to complete the analysis. I am really interested in doing more research on what the differences are between these rating platforms (partially because many people believe these platforms are biased).
The tv shows dataset was acquired form Kaggle user Ruchi Bhatia: https://www.kaggle.com/ruchi798/tv-shows-on-netflix-prime-video-hulu-and-disney
The solution for annotating the graphs seen above was taken from this source: https://www.geeksforgeeks.org/how-to-annotate-bars-in-barplot-with-matplotlib-in-python/