Introduction & Motivation
Research Questions
Data Collection
Data Description
User Score
Preliminary Analysis
Methodology & Diagnostics
Results
Discussion
World Gross
Preliminary Analysis
Methodology & Diagnostics
Results
Discussion
Conclusion
Sources
Film has become a major form of art, inspiring different writers and directors to tell all sorts of riveting stories. At the same time, films can also be regarded as a form of investment that is extremely dependent on capital and industrial standards. Film has also transformed into a major form of entertainment, consuming countless hours in many people's lives. With how influential films are to creators, investors, and casual movie-goers, it is important to ask what makes a film "successful"? Naturally, a question like this can be difficult to answer directly since "successful" can be defined in many different ways. In order to address this issue, we decided, for our analysis, to specifically associate a movie's "success" with its user score and world gross. With "success" explicitly defined, we were able to narrow our focus and ask the following questions.
What are the significant factors that impact whether or not a movie gets a good user score?
What are the significant factors that impact the world gross of a movie?
All of our data comes from the following websites: IMDB Top 1000 Movies, IMDB Bottom 1000 Movies, The Numbers Budget & Financial Performance, Oscar Winners, and Insider Top 27 Movie Franchises. We used web scraping to extract the data we wanted from each website and then combined the different datasets into one final dataset. We cleaned our data by deleting any rows that had no values, recategorizing several variables, such as genre, and making sure the different datasets matched up well with each other. We used several functions to assist us in creating our final dataset.
Using web scraping, we use a self-defined function to create a data frame that extracts the release date, movie title, production budget, and worldwide gross for 6345 movies from the Numbers website. The data frame contains the information of every movie on the website with modifications needed because of the errors in data entry, such as the NA value or an unknown release date. In the data cleaning process, we use self-defined functions to standardize movie titles by removing movie titles with non-English language characters, non-character symbols, and misspelling and combining similar movie titles from different data frames to expand the information for the movies.
By web scraping from the IMDB websites, we extracted information about the top 1000 movies and bottom 1000 movies, which includes movie titles, meta score, user score, year, genre, movie rating, duration in minutes, directors, and lead actor. After extracting these 2000 rows of data, we dropped any rows that had no values and ended up with 1663 rows. By looking at our data set, we noticed that there are many different genres for each movie. We simplified the genres by categorizing and simplifying each row into the most dominant genre. In this way, we ended up with 9 different kinds of genres. We also added several new columns with dummy variables, such as director_one, which would indicate whether or not a movie had only one director.
In addition to this, we ultilized an undocumented API when extracting information from the Oscars website. More specifically, we extracted all Oscar winners from 1934 to 2021. We then filtered for winners that were either actors or directors. Afterward, we created functions that would produce new binary variables, which would tell us the movies with an Oscar lead and the movies with an Oscar director.
We also web scraped from the Insider website, extracting movies which are under top franchises selected by movie critics. Some notable franchises include Spider-Man, Batman, Star Wars, and James Bond. After extracting these titles, we created functions that would produce a new indicator variable, which would indicate the movies in our dataset that are part of an established franchise/IP.
year | gross_wor | budget | score_user | score_user_good | score_meta | ip | oscar_lead | director_one | oscar_director | ... | rating_r | genre_action_adv | genre_animation | genre_bio | genre_comedy | genre_comedy_drama | genre_drama | genre_horror | genre_romance | genre_fantasy_sci | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2008 | 2.690657e+08 | 105000000.0 | 5.1 | 0 | 34.0 | 0 | 0 | 1 | 0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 1995 | 1.688415e+08 | 29000000.0 | 8.0 | 1 | 74.0 | 0 | 0 | 1 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
2 | 2013 | 1.807651e+08 | 20000000.0 | 8.1 | 1 | 96.0 | 0 | 0 | 1 | 0 | ... | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 2019 | 3.891404e+08 | 100000000.0 | 8.2 | 1 | 78.0 | 0 | 0 | 1 | 1 | ... | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 2003 | 5.966762e+07 | 20000000.0 | 7.6 | 1 | 70.0 | 0 | 0 | 1 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
930 | 2007 | 8.308008e+07 | 85000000.0 | 7.7 | 1 | 78.0 | 0 | 0 | 1 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
931 | 2011 | 1.708055e+08 | 80000000.0 | 5.2 | 0 | 30.0 | 0 | 0 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
932 | 2016 | 5.534869e+07 | 50000000.0 | 4.7 | 0 | 34.0 | 0 | 0 | 1 | 0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
933 | 2006 | 1.250619e+07 | 35000000.0 | 4.3 | 0 | 26.0 | 0 | 0 | 1 | 0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
934 | 2016 | 1.004630e+09 | 150000000.0 | 8.0 | 1 | 78.0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
935 rows × 27 columns
In the end, our analysis utilizes the dataset shown above, with 935 observations and 27 different variables. These variables include year, gross_wor, budget, score_user, score_user_good, score_meta, ip, oscar_lead, director_one, oscar_director, mins, log_year, log_gross_wor, log_budget, log_mins, rating_pg, rating_pg_13, rating_r, genre_action_adv, genre_animation, genre_bio, genre_comedy, genre_comedy_drama, genre_drama, genre_horror, genre_romance, genre_fantasy_sci. year is the year a movie is released. gross_wor is the world gross of a movie measured in USD. budget is the production budget of movie measured in USD. score_user and score_meta are user ratings, scaled from 0 to 10, and critic ratings, scaled from 0 to 100, with dummy variable score_user_good to indicate whether or not the user rating is greater than or equal to 7. ip is a dummy variable that indicates whether or not a movie is part of an established IP, such as Spider-Man, Batman, Star Wars, and James Bond. oscar_lead and oscar_director are dummy variables that tell us whether or not a movie has an Oscar-winning lead actor and an Oscar-winning director respectively. director_one tells us whether or not a movie has one director. mins is the duration of a movie measured in minutes. log_year, log_gross_wor, log_budget, and log_mins are the log of year, gross_wor, budget, and mins respectively. Each of the rating and genre related variables are indicators that tell us whether or not a movie has a particular rating and whether or not it belongs to a particular genre.
Before applying any methodology, we conducted some premilinary analysis on score_user. We started by creating different scatterplots with score_user as our response variable and year, gross_wor, budget, mins, and score_meta as separate explanatory variables. However, rather than following a linear or a dispersed pattern, points in each plot strictly appear either below or above score_user = 7. This indicates that score_user behaves less like a continuous response and more like a binary response, with one category where score_user is less than 7 or "bad" and another where score_user at least 7 or "good".
This binary characteristic of score_user is also highlighted in its distribution. As shown above, we can see that the distribution is bimodal, with one center around score_user = 5 and the another around score_user = 8. Again, this inidicates that we can divide score_user into 2 groups, with one group representing bad user scores and another representing "good" user scores.
Given the "binary" characteristic revealed by the scatterplots and distribution, we believed that it would be most appropriate to apply a logististic regression. However, with this technique, the response of each observation must independently follow a Bernoulli distribution. Now, it is quite difficult for us to confirm independence since we lack information on the specific people writing movie reviews (i.e. the score of a horror film will likely not be independent from the score of a romance film if they are both reviewed by people who are hardcore horror fans). And so, for the purposes of our analysis, we assume that there is independence.
With that said, we started by creating a new feature called score_user_good, which will convert score_user into a Bernoulli variable that becomes 1 if score_user is at least 7 or "good" and becomes 0 if score_user is less than 7 or "bad".
With the addition of score_user_good, we can conduct our preliminary analysis with a new perspective. Instead of considering how score_user is correlated with other continuous variables, we can compare different distributions to gain some understanding of how different values for a continuous input affect the probability of a movie getting a good user score.
It is worth mentioning that we also had to create new features such as log_year, log_gross_wor, log_budget, and log_mins which log-transform year, gross_wor, budget, and mins respectively. This is because there were extremely low values for year and extremely high values for gross_wor, budget, and mins, which may cause the effect of other covariates to be diluted. By applying a log-transformation, these variables are put on a common log-scale and won't dilute the impact of any other covariates. We, however, did not apply this to score_meta since certain observations had score_meta = 0 (NOTE: log(0) is undefined). Although, we believe this will not cause too many issues since there are no values for score_meta that are extremely small or large to dilute other covariates.
With that said, the plots above show distributions for log_year, log_gross_wor, log_budget, log_mins, and score_meta given that a movie gets a good user score, which is marked in green, or a bad user score, which is marked in red. For the log_gross_wor and log_budget plots, we can see that there is a lot of overlap between the good and bad user score distributions, which suggests that log_gross_wor and log_budget may each be independent from score_user_good. For the log_year and log_mins plot, there is still a considerable amount of overlap; however, we can see that the probability of a movie getting a good user score is slightly higher when it is recent and slightly lower when it is longer. This suggests that log_year and log_mins may have a slight significant effect on whether or not a movie gets a good user score. For the score_meta plot, there is the least amount of amount overlap, and we can clearly that the probability of a movie getting a good user score given a higher meta score is higher than when given a lower meta score. This suggests that score_meta may have a huge significant effect on whether or not a movie gets a good user score.
P(score_user_good=1|x=1) | P(score_user_good=1|x=0) | |
---|---|---|
ip | 0.915254 | 0.425799 |
oscar_lead | 0.701031 | 0.428401 |
director_one | 0.452244 | 0.515152 |
oscar_director | 0.803922 | 0.436652 |
rating_pg | 0.407821 | 0.468254 |
rating_pg_13 | 0.334262 | 0.532986 |
rating_r | 0.589421 | 0.358736 |
genre_action_adv | 0.429204 | 0.465444 |
genre_animation | 0.568182 | 0.451178 |
genre_bio | 0.943396 | 0.427438 |
genre_comedy | 0.169492 | 0.498164 |
genre_comedy_drama | 0.603175 | 0.446101 |
genre_drama | 0.746575 | 0.403042 |
genre_horror | 0.198473 | 0.498756 |
genre_romance | 0.391304 | 0.465854 |
genre_fantasy_sci | 0.435897 | 0.457589 |
In order to gain some understanding on how our binary covariates may effect score_user_good, we calculated the sample probabilities of getting a good user score when a covariate is either 1 or 0. As we can see in the table above, covariates marked in red are those where these probabilities are roughly equal (NOTE: the difference tolerance that we set is at most 10%). This suggests that a movie getting a good user score may not depend on whether or not it is directed by one director, PG, action/adventure, romance, or sci-fi/fantasy. This leaves us with the unmarked covariates, where the probability of getting a good user score when x = 1 is not equal to when x = 0. This suggests that a movie getting a good user score may depend on whether or not it is part of an established IP, PG-13, R, animation, biography, comedy, comedy-drama, drama, horror, has Oscar leads, or has Oscar directors.
coef | std err | z | P>|z| | |
---|---|---|---|---|
const | 1381.323500 | 291.223000 | 4.743000 | 0.000000 |
log_year | -269.034500 | 55.629000 | -4.836000 | 0.000000 |
log_budget | -1.873400 | 0.362000 | -5.174000 | 0.000000 |
log_gross_wor | 0.915600 | 0.216000 | 4.244000 | 0.000000 |
log_mins | 11.755400 | 2.550000 | 4.609000 | 0.000000 |
score_meta | 0.202000 | 0.024000 | 8.569000 | 0.000000 |
ip | 0.712800 | 1.065000 | 0.669000 | 0.503000 |
oscar_lead | 1.510200 | 1.072000 | 1.409000 | 0.159000 |
director_one | -0.439600 | 1.044000 | -0.421000 | 0.674000 |
oscar_director | 0.531600 | 1.470000 | 0.362000 | 0.718000 |
rating_pg | 459.395300 | 96.729000 | 4.749000 | 0.000000 |
rating_pg_13 | 460.527000 | 97.189000 | 4.738000 | 0.000000 |
rating_r | 461.401200 | 97.309000 | 4.742000 | 0.000000 |
genre_action_adv | 152.872000 | 32.308000 | 4.732000 | 0.000000 |
genre_animation | 157.433600 | 33.004000 | 4.770000 | 0.000000 |
genre_bio | 154.614600 | 32.416000 | 4.770000 | 0.000000 |
genre_comedy | 151.378200 | 32.095000 | 4.717000 | 0.000000 |
genre_comedy_drama | 153.490200 | 32.403000 | 4.737000 | 0.000000 |
genre_drama | 154.426000 | 32.464000 | 4.757000 | 0.000000 |
genre_horror | 150.792100 | 32.141000 | 4.692000 | 0.000000 |
genre_romance | 152.767500 | 32.247000 | 4.737000 | 0.000000 |
genre_fantasy_sci | 153.549300 | 32.273000 | 4.758000 | 0.000000 |
VIF | |
---|---|
rating_pg_13 | inf |
rating_r | inf |
genre_romance | inf |
genre_horror | inf |
genre_drama | inf |
genre_comedy_drama | inf |
genre_comedy | inf |
genre_bio | inf |
genre_animation | inf |
genre_action_adv | inf |
genre_fantasy_sci | inf |
rating_pg | inf |
log_mins | 2.365346 |
log_budget | 2.336267 |
score_meta | 2.064101 |
log_gross_wor | 2.048218 |
log_year | 1.410934 |
ip | 1.308119 |
director_one | 1.160660 |
oscar_director | 1.090596 |
oscar_lead | 1.083236 |
coef | std err | z | P>|z| | |
---|---|---|---|---|
const | -1.036500 | 0.376000 | -2.760000 | 0.006000 |
ip | 3.441800 | 0.503000 | 6.849000 | 0.000000 |
oscar_lead | 0.930200 | 0.281000 | 3.309000 | 0.001000 |
director_one | -0.559000 | 0.329000 | -1.698000 | 0.089000 |
oscar_director | 1.625100 | 0.415000 | 3.917000 | 0.000000 |
rating_pg | 0.661200 | 0.263000 | 2.511000 | 0.012000 |
rating_r | 1.661300 | 0.202000 | 8.244000 | 0.000000 |
genre_animation | 0.856600 | 0.424000 | 2.021000 | 0.043000 |
genre_bio | 3.425900 | 0.633000 | 5.414000 | 0.000000 |
genre_comedy | -1.097400 | 0.319000 | -3.439000 | 0.001000 |
genre_comedy_drama | 0.865200 | 0.330000 | 2.620000 | 0.009000 |
genre_drama | 1.366300 | 0.273000 | 4.997000 | 0.000000 |
genre_horror | -1.472600 | 0.315000 | -4.671000 | 0.000000 |
genre_romance | 0.191400 | 0.273000 | 0.701000 | 0.483000 |
genre_fantasy_sci | -0.009200 | 0.405000 | -0.023000 | 0.982000 |
VIF | |
---|---|
director_one | 4.358739 |
rating_r | 2.284861 |
rating_pg | 1.929548 |
genre_drama | 1.732625 |
genre_horror | 1.619540 |
genre_comedy | 1.557111 |
genre_romance | 1.486488 |
genre_animation | 1.395103 |
genre_comedy_drama | 1.301074 |
genre_bio | 1.273952 |
oscar_lead | 1.167202 |
genre_fantasy_sci | 1.152787 |
ip | 1.132938 |
oscar_director | 1.088296 |
As of now, we suspect that variables such as log_year, log_mins, score_meta, ip, oscar_lead, oscar_director, rating_pg_13, rating_r, genre_animation, genre_bio, genre_comedy, genre_comedy_drama, genre_drama, and genre_horror may be significant factors that impact whether or not a movie gets a good user score. However, we put this to the test by first fitting a full logistic model on score_user_good. Although we cannot confirm the independence of each observation, we feel that it is most appropriate to apply logistic regression because, as shown earlier, score_user behaves like a binary variable.
With that said, the first pair of tables above show the full model summary alongwith VIF measures for each covariate (NOTE: VIF is a measure of how correlated a given covariate is to other covariates, where a value greater than 5 indicates high correlation). As shown in the VIF table, the full model has covariates with extremely large VIF values. This is a problem because this indicates that multicollinearity is present. With multicollinearity, we are more uncertain as to the true effect a particular covariate has on our response, which may explain why some of our standard errors are extremely large. One way that we decided to remedy this is by removing covariates with the highest VIF one at a time until all covariates had a VIF value less than 5, which left us with the reduced model summarized above.
Based on our reduced model summary, ip, oscar_lead, oscar_director, rating_pg, rating_r, genre_animation, genre_bio, genre_comedy, genre_comedy_drama, genre_drama, and genre_horror are the significant factors that impact whether or not a movie gets a good user score. Specifically, the estimated difference in log-odds of getting a good user score, holding all other variables constant is 3.44 between movies part of an established IP vs those that aren't, 0.93 between movies with an Oscar lead vs those without one, 1.63 between movies with an Oscar director vs those without one, 0.66 between PG vs non-PG movies, 1.66 between R vs non-R movies, 0.86 between animation vs non-animation movies, 3.43 between biography vs non-biography movies, -1.10 between comedy vs non-comedy movies, 0.87 between comedy-drama vs non-comedy-drama movies, 1.37 between drama vs non-drama movies, and -1.47 between horror vs non-horror movies.
In other words, it is more likely for a movie to get a good user score when it is part of an established IP, PG, R, animation, biography, comedy-drama, drama, Oscar-led, or Oscar-directed. On the other hand, it is less likely for a movie to get a good user score when it is comedy or horror.
However, it is worth noting that some of our coefficient p-values for our reduced model may be "exaggerated" and a lot smaller than what they're supposed to be. As such, the reported effects on the likelihood of getting a good user score may be greater than the actual truth.
Although some of our reported effects may be "exaggerated", many of the general trends that we are seeing are quite sensible.
For instance, it is not peculiar to see IP play a significant role in increasing the likelihood of getting a good user score. One possible explanation is that movies part of an established IP like Spider-Man or Batman can bring back warm, nostalgic memories. This may then result in casual viewers feeling overly joyous and sentimental- so much so that they'd feel compelled to write a good review.
In addition to this, it is not surprising that Oscar leads and directors also increase the chance of users giving good scores. This could be explained through the simple notion that these acclaimed individuals possess the best acting and directing skills necessary to provide the most exciting, entralling movie experience. Such captivation is bound to leave audience members more than satisfied to give a good rating.
It is also not far-fetched to see genres like biography, drama boost a movie's user score. This is likely connected to how biography and drama are serious, grounded genres, which would push writers to tell deep, compelling stories that engages with movie-goers. This may eventually result in viewers developing a profound connection with a movie and thus giving a good score. A similar explanation could be provided for R movies, but rather than talking about serious genres, we'd be talking about serious ratings.
Animation also appears to boost user score. This could be because animated movies are generally geared towards a younger audience, striving to teach children meaningful life lessons. It is likely that many parents are able to understand these subliminal messages and may feel obligated to post good reviews in order to let other parents know that a particular movie is worth watching with the children. A similar explanation could be provided for PG movies.
While these factors are shown to increase the chance of getting a good user score, comedy and horror movies are shown to decrease this chance. One possible explanation is that comedy and horror movies are generally geared towards teenagers, who often go to the movies to "turn off their brains". This likely results in many producers saving money on "cool fight scenes" rather than talented writers. Without people to write engaging stories, more matured viewers are bound to feel disappointed with their movie experience and leave a poor score.
Before applying any methodology, it is worth noting that there is a weak positive correlation or no correlation between the log_gross_wor against other numeric variables in those scatterplots with one exception. The scatterplot for log_budget vs log_gross_wor shows a clear positive correlation between the two variables, unlike the other scatterplots. In context, it suggests that a higher movie budget tends to have a higher gross worldwide, and factors like year, mins, and score meta won’t necessarily increase gross worldwide.
For the boxplots of score_user_good, ip, oscar_lead, director_one, and oscar_director, only the boxplot for oscar_lead and director_one has a similar median and the length of the box overlaps with the other group within the variable, which indicate it’s likely that having an oscar lead or having only one director might not make a difference to the movie gross worldwide. For other variables, the median and length of the box are different within each variable, which indicates that having a good user score, an ip movie, or an oscar director tends to make a noticeable difference to a movie's gross worldwide.
For the boxplots of different ratings, it is worth pointing out that all boxplots have roughly the same center and shape, which suggests that ratings might not influence a movie's worldwide gross.
For the boxplots of different genres, it is worth pointing out that all boxplots have roughly the same center and shape, which suggests that genre might not influence a movie's worldwide gross.
As of now, we suspect that variables such as log_budget, score_user_good, ip, and oscar_director may be significant factors that impact a movie's worldwide gross. However, we put this to the test by first fitting a full linear regression model on log_gross_wor and checking for homoskedasticity, normality, and influential points.
The fitted versus residuals plot shows the spread of the residuals is decreasing as the fitted values change, which looks like a funnel shape. The spread tells us the variance is not constant, so we conclude that there is heteroskedasticity. The graph has dashed lines at -3 and 3 with the space shaded in below -3 and above 3, which shows that any point within these intervals is an outlier. From this, we know there are several outliers.
The second scatter plot is a normality QQ plot, which tells us about the distribution of the residuals. If the errors are normally distributed, we expect almost all residuals to align with the straight blue line. But in this case, we can see that the distribution is not normal but slightly left-skewed.
In the third scatter plot, which is the leverage versus residuals plot, there does not appear to be any influential points. Although there are a considerable amount of outliers, none of them appear to have extremely large leverage.
coef | std err | P>|t| | |
---|---|---|---|
Intercept | -158.300400 | 36.108000 | 0.000000 |
log_year | 30.554700 | 6.866000 | 0.000000 |
log_budget | 0.632400 | 0.039000 | 0.000000 |
log_mins | 0.657700 | 0.282000 | 0.020000 |
score_meta | 0.006800 | 0.003000 | 0.018000 |
score_user_good | 0.922500 | 0.156000 | 0.000000 |
ip | 0.591900 | 0.163000 | 0.000000 |
oscar_lead | -0.189300 | 0.118000 | 0.110000 |
director_one | -0.309800 | 0.146000 | 0.034000 |
oscar_director | -0.020200 | 0.159000 | 0.899000 |
rating_pg | -52.443400 | 12.014000 | 0.000000 |
rating_pg_13 | -52.736600 | 12.047000 | 0.000000 |
rating_r | -53.120400 | 12.048000 | 0.000000 |
genre_action_adv | -17.615400 | 4.003000 | 0.000000 |
genre_animation | -17.731800 | 4.045000 | 0.000000 |
genre_bio | -17.716300 | 4.015000 | 0.000000 |
genre_comedy | -17.499800 | 4.006000 | 0.000000 |
genre_comedy_drama | -17.881700 | 4.019000 | 0.000000 |
genre_drama | -17.594300 | 4.011000 | 0.000000 |
genre_horror | -17.129400 | 4.011000 | 0.000000 |
genre_romance | -17.651800 | 4.011000 | 0.000000 |
genre_fantasy_sci | -17.479900 | 4.003000 | 0.000000 |
abs_res | |
---|---|
67 | 7.182360 |
900 | 5.495361 |
517 | 4.384742 |
762 | 4.306547 |
691 | 3.986845 |
618 | 3.793118 |
882 | 3.767409 |
358 | 3.601483 |
860 | 3.559064 |
901 | 3.445995 |
738 | 3.066468 |
701 | 3.024348 |
stat | conclude | |
---|---|---|
error_corr_dw | 1.865000 | NOT CORRELATED |
r_sq_adj | 0.520000 | MODEL EXPLAIN AT LEAST HALF VAR |
The third chart shows that adjusted R squared is 0.52, which means 52% of the variability observed in the target variable is explained by the regression model. We also applied Durbin-Watson test for autocorrelation in the residuals. The range for the Durbin-Watson statistic is from 0 to 4, and we find that our test statistics is 1.87, which indicates there is no autocorrelation of errors.
After dropping outliers such as observation 67, 900, 517, 762, 691, 618, 882, 358, 860, 901, 701, 260, 659, 496, 738, 830, 718, and 838, the fitted versus residuals plot show the spread of the residuals remaining fairly constant as fitted values change. This suggests that homoscedasticity may be upheld.
After dropping the aforementioned observations, we can see almost all residuals align with the straight blue line, which indicates that the errors may be normally distributed.
After dropping the aforementioned observations, there still does not appear to be any influential points that may potentially sway our slope estimates.
coef | std err | P>|t| | |
---|---|---|---|
Intercept | -154.583900 | 30.917000 | 0.000000 |
log_year | 29.827100 | 5.882000 | 0.000000 |
log_budget | 0.617800 | 0.034000 | 0.000000 |
log_mins | 0.753700 | 0.244000 | 0.002000 |
score_meta | 0.006400 | 0.002000 | 0.010000 |
score_user_good | 0.816700 | 0.136000 | 0.000000 |
ip | 0.628900 | 0.138000 | 0.000000 |
oscar_lead | -0.112100 | 0.101000 | 0.269000 |
director_one | -0.218700 | 0.124000 | 0.078000 |
oscar_director | -0.081900 | 0.135000 | 0.544000 |
rating_pg | -51.203600 | 10.286000 | 0.000000 |
rating_pg_13 | -51.559600 | 10.315000 | 0.000000 |
rating_r | -51.820600 | 10.316000 | 0.000000 |
genre_action_adv | -17.208600 | 3.428000 | 0.000000 |
genre_animation | -17.299200 | 3.463000 | 0.000000 |
genre_bio | -17.341300 | 3.437000 | 0.000000 |
genre_comedy | -17.169100 | 3.431000 | 0.000000 |
genre_comedy_drama | -17.391300 | 3.442000 | 0.000000 |
genre_drama | -17.188400 | 3.434000 | 0.000000 |
genre_horror | -16.717000 | 3.434000 | 0.000000 |
genre_romance | -17.129900 | 3.434000 | 0.000000 |
genre_fantasy_sci | -17.139000 | 3.428000 | 0.000000 |
abs_res |
---|
stat | conclude | |
---|---|---|
error_corr_dw | 1.822000 | NOT CORRELATED |
r_sq_adj | 0.576000 | MODEL EXPLAIN AT LEAST HALF VAR |
The third chart shows that adjusted R squared is 0.576, which means 57.6% of the variability observed in the target variable is explained by the regression model, which is 5.6% higher than before we dropped the outliers. From the Durbin-Watson statistics, we get a similar result of test statistics which is 1.822 which is also close to 2 and indicates there is no autocorrelation of errors.
coef | std err | t | P>|t| | |
---|---|---|---|---|
Intercept | -154.583900 | 30.917000 | -5.000000 | 0.000000 |
log_year | 29.827100 | 5.882000 | 5.071000 | 0.000000 |
log_budget | 0.617800 | 0.034000 | 18.072000 | 0.000000 |
log_mins | 0.753700 | 0.244000 | 3.092000 | 0.002000 |
score_meta | 0.006400 | 0.002000 | 2.578000 | 0.010000 |
score_user_good | 0.816700 | 0.136000 | 6.016000 | 0.000000 |
ip | 0.628900 | 0.138000 | 4.552000 | 0.000000 |
oscar_lead | -0.112100 | 0.101000 | -1.106000 | 0.269000 |
director_one | -0.218700 | 0.124000 | -1.763000 | 0.078000 |
oscar_director | -0.081900 | 0.135000 | -0.607000 | 0.544000 |
rating_pg | -51.203600 | 10.286000 | -4.978000 | 0.000000 |
rating_pg_13 | -51.559600 | 10.315000 | -4.998000 | 0.000000 |
rating_r | -51.820600 | 10.316000 | -5.023000 | 0.000000 |
genre_action_adv | -17.208600 | 3.428000 | -5.020000 | 0.000000 |
genre_animation | -17.299200 | 3.463000 | -4.995000 | 0.000000 |
genre_bio | -17.341300 | 3.437000 | -5.045000 | 0.000000 |
genre_comedy | -17.169100 | 3.431000 | -5.004000 | 0.000000 |
genre_comedy_drama | -17.391300 | 3.442000 | -5.052000 | 0.000000 |
genre_drama | -17.188400 | 3.434000 | -5.005000 | 0.000000 |
genre_horror | -16.717000 | 3.434000 | -4.868000 | 0.000000 |
genre_romance | -17.129900 | 3.434000 | -4.988000 | 0.000000 |
genre_fantasy_sci | -17.139000 | 3.428000 | -5.000000 | 0.000000 |
VIF | |
---|---|
rating_pg_13 | inf |
rating_r | inf |
genre_romance | inf |
genre_horror | inf |
genre_drama | inf |
genre_comedy_drama | inf |
genre_comedy | inf |
genre_bio | inf |
genre_animation | inf |
genre_action_adv | inf |
genre_fantasy_sci | inf |
rating_pg | inf |
score_user_good | 5.235725 |
score_meta | 4.299373 |
log_mins | 2.535535 |
log_budget | 1.924279 |
log_year | 1.420812 |
ip | 1.313645 |
director_one | 1.158891 |
oscar_director | 1.094042 |
oscar_lead | 1.081227 |
coef | std err | t | P>|t| | |
---|---|---|---|---|
Intercept | 18.137500 | 0.182000 | 99.638000 | 0.000000 |
score_user_good | 0.860500 | 0.096000 | 9.002000 | 0.000000 |
ip | 1.332500 | 0.175000 | 7.617000 | 0.000000 |
oscar_lead | 0.108700 | 0.130000 | 0.834000 | 0.405000 |
director_one | -0.108200 | 0.160000 | -0.675000 | 0.500000 |
oscar_director | 0.360700 | 0.171000 | 2.104000 | 0.036000 |
rating_pg | -0.078800 | 0.123000 | -0.642000 | 0.521000 |
rating_r | -0.551200 | 0.094000 | -5.863000 | 0.000000 |
genre_animation | 0.604200 | 0.226000 | 2.675000 | 0.008000 |
genre_bio | -0.318600 | 0.190000 | -1.678000 | 0.094000 |
genre_comedy | -0.492900 | 0.142000 | -3.477000 | 0.001000 |
genre_comedy_drama | -0.686900 | 0.174000 | -3.943000 | 0.000000 |
genre_drama | -0.388400 | 0.135000 | -2.879000 | 0.004000 |
genre_horror | -0.352900 | 0.140000 | -2.519000 | 0.012000 |
genre_romance | -0.436500 | 0.140000 | -3.121000 | 0.002000 |
genre_fantasy_sci | -0.222200 | 0.205000 | -1.085000 | 0.278000 |
VIF | |
---|---|
director_one | 4.402215 |
score_user_good | 2.798125 |
rating_r | 2.486503 |
rating_pg | 1.951952 |
genre_drama | 1.845546 |
genre_horror | 1.616790 |
genre_comedy | 1.574226 |
genre_romance | 1.480937 |
genre_animation | 1.435119 |
genre_bio | 1.392133 |
genre_comedy_drama | 1.327843 |
ip | 1.298266 |
oscar_lead | 1.184218 |
genre_fantasy_sci | 1.157565 |
oscar_director | 1.112706 |
Now, we intially fit a full model without dropping any predictor variables; however, even after removing problematic observations, we see that rating_pg_13 has the highest VIF, meaning that rating_pg_13 is highly correlated with at least one other predictor in the model. This is an issue because multicollinearity makes it difficult to determine the true effect a particular covariate has on the response, which would explain why some of our standard errors are considerably large.
One way of remedying this issue is by dropping variables with the largest VIF value. In the end, our final reduced model includes variables such as score_user_good, ip, oscar_lead, director_one, oscar_director, rating_pg, rating_r, genre_animation, genre_bio, genre_comedy, genre_comedy_drama, genre_drama, genre_horror, genre_romance, and genre_fantasy_sci.
We can say that, holding all other variables constant, the approximate difference in the log of world gross is 0.86 between movies with a good user score vs a bad user score, 1.3 between movies part of an established IP vs those that aren’t, 0.36 between movies with an Oscar director vs those without one, -0.55 between R vs non-R movies, 0.60 between animation vs non-animation movies, -0.49 between comedy vs non-comedy movies, -0.69 between comedy & drama vs non-comedy & drama movies, -0.39 between drama vs non-drama movies, -0.35 between horror vs non-horror movies, -0.44 between romance vs non-romance movies.
Simply put, a movie’s world gross tends to be larger when it has a good user score, is part of an established IP, has an Oscar director, and falls under the animation genre. On the other hand, a movie’s world gross tends to be smaller when it is R rated, fall under the comedy, comedy & drama, drama, horror, and romance genres.
However, it is important to note that some of our coefficient p-values for our reduced model may be "exaggerated" and a lot smaller than what they're supposed to be. And so, some of the aforementioned effects on the log of world gross may be greater than the actual truth.
Based on these descriptions, it appears that movies with good user ratings tend to have higher worldwide gross than those that are poorer. This is likely connected to the simple notion that a higher rating generally indicates better quality, whether it is because of the plot, genre, or execution of the movie. Potential viewers will look at user ratings to choose which movie to watch, and lower user ratings often turn people away from watching a movie. This then decreases the worldwide gross. Higher user ratings have a similar effect, encouraging people to spend money on a movie.
Movies under a popular franchise IP tend to have a higher worldwide gross than those that aren’t. One possible explanation is that movies with brand recognition are well-known and well-received. These movies already have a large support base. In comparison, standalone movies have to start from the bottom. There is no pre-existing plot nor are there any dedicated fans. Therefore, it makes sense that movies under a popular franchise, such as Star Wars or James Bond, tend to earn more than those that aren’t.
It also appears that movies with an Oscar director earn more than those that don’t. This is likely due to the fact that directors who earn an Oscar know how to direct a captivating movie. Not everyone can earn an Oscar, so those that do are incredibly talented and skilled in their profession. Therefore, we can assume that Oscar directors generally create better movies, which leads to higher viewer traction and higher worldwide gross.
On the other hand, R rated movies seem to generate less worldwide gross, which may be because the rating restricts the audience. From the very start, fewer people can view the movie. Furthermore, even with people who can watch the movie, not everyone would like to spend more money in order to watch it again.
While R rated movies generate less income because of the restricted audience, animated movies seem to generate more worldwide gross compared to all the other types of movie genres. Movies that are under the animation genre are generally family-friendly, which means anyone can watch it. Furthermore, animation is often well-received from most people, possibly because the style is more appealing compared to movies with real life actors.
Meanwhile, movies that fall under the comedy, comedy-drama, drama, horror, and romance genres all generate less worldwide gross, likely because movies under these genres do not attract as many viewers. Genres like comedy, comedy-drama, and drama are similar to each other in that they all have a certain type of exaggeration involved in the plot, whether for conflict or laughs. In addition to this, horror is not for everyone, since it involves gore, jump scares, eerie music, and similar elements. Lastly, not everyone would like the overly sentimental plotlines in romance movies. These factors all decrease the number of potential viewers, and therefore the worldwide gross for these movies.
In the end, we found through logistic regression techniques there were 11 significant factors that impact whether or not a movie gets a good user score. These include ip, oscar_lead, oscar_director, rating_pg, rating_r, genre_animation, genre_bio, genre_comedy, genre_comedy_drama, genre_drama, and genre_horror. There are 9 variables that tend to increase the chance of a movie getting a good user score, which are IP, a PG rating, R rating, animation, biography, comedy-drama, drama, Oscar led, or Oscar directed. There are 2 variables that tend to decrease the chance of a movie getting a good user score which are comedy or horror.
Furthermore, we found through linear regression techniques that movies that have good user scores, under a popular franchise, have an Oscar director, or are animated have a higher worldwide gross than those that are not. We also found that R movies and genres other than animation such as comedy, comedy-drama, drama, horror, and romance animated generally earn a lower worldwide gross.
With that said, it is worth noting there were several technical difficulties that we faced throughout our analysis.
Before fully cleaning the dataset, there was only one genre column, the majority of which had multiple different genres to describe each movie. To make our data analysis more concise, we had to go through each movie and its genre one by one. For example, we first looked at all the indices that had horror as part of its listed genres, and then created a new list of genres that simplifies it to only include movies that we determined were only horror. A movie could have horror, action, and thriller as its genres, and then we would simplify it down to just horror. This process was very subjective, since we had to decide whether each movie fit into one of the genres more than the other listed ones. We also decided what those groups of genres were. Because of its inherent subjectiveness, we could have movies that other people would not see as our chosen genre.
After standardization, we thought that we could merge both our IMDB and Numbers data perfectly, but we found that there were some movie names that were different, for example, there would be two rows right next to each other with the same movie title instead of one row because of a mismatch in the title string, such as "10 000 B C (2008)" and "10 000 BC (2008)" so we decided to look up each row and replace the movie name consistently so that we can merge by the correct movie title. Also, there are some movie years that are different such as “TEETH (2008)" and "TEETH (2007)”, so we need to replace them with the correct movie year manually.
On the Oscars website, most events were shown to be held during one year; however, there were some that were shown to be held during 1934/1935. In order to remedy this issue, we decided to select the first year shown with events held during 2 years. By doing this, each event would correspond to one, unique year, which we believed was appropriate since Oscars are only held once a year. Despite our best efforts, we believe that slight discrepancies may still have been present in our final data. This is because other websites would claim that the 2000 Oscars were held in 2001. Unfortunately, we couldn’t determine why these inconsistencies were occurring among different sources. In the end, we believed it was best to use the years listed on the Oscars website since the people managing the site are part of the same organization that holds the Oscar events.
We also extracted information from the Insider website, which listed the top 27 franchises according to movie critics. One inherent issue that arises when extracting this type of information is determining what sources are defining as “top”. As mentioned earlier, this list was composed by movie critics; however, the site doesn't make it explicitly clear what metrics critics used. The site would mention that franchises such as the MCU and Star Wars have dominated the box office while others have would earn more critical acclaim; however, we couldn’t determine if variables such as world gross and movie score were variables that critics specifically considered.