All the Python scripts for this project can be found here
A Tableau storyboard with the full analysis is available here
An interactive dashboard is available here
Netflix is one of the most popular streaming platforms in the world, with over 8,000 titles accessible, categorized into movies and TV shows, and over 200 million users worldwide.
The goal of the analysis is to help Netflix determine what aspects contribute to the success of a movie or TV program.
The dataset for this project is open-source and can be downloaded here. It was gathered from JustWatch in March 2023, and it contains data available in the United States.
The data set had a lot of missing information; therefore, handling the incomplete data was fairly challenging. I decided to drop a potentially valuable column (age certification) since there were too many missing values (2743).
First and foremost, I created a heatmap to better understand the correlation of all numerical variables. As we can see, there are mostly very weak correlations, negative correlations, or no correlations at all. The only positive correlation seems to be between the IMDb score and the TMDb score.
What are the most popular genres on Netflix is one of the business questions to be answered.
According to the visualization above, the most popular Netflix genres are:
Some pertinent business questions to address include which nations create the highest-rated Netflix titles and which have the most content.
As we can see from the visualization above, the countries that produce the most popular titles are the United States, Mexico, Columbia, the UK, Norway, Italy, Poland, Russia, Korea, Japan, and New Zealand.
Instead, the most popular movies and TV shows are made in the United States and the United Kingdom.
Finally, the countries that produce the most content are:
I performed linear regression to test my hypotheses:
Therefore, I examined the correlations between TMDb popularity and scores and IMDb votes and scores.
In both cases, the correlation value is almost zero, indicating a very weak relationship between the variables.
In all the graphs, the pink cluster (zero) outperformed all the others. We can notice an interesting pattern: the most popular titles were published most recently (after 2010).
The profile for each cluster is shown below:
The linear regression revealed that the relationships between ratings, popularity, and the number of votes are weak, suggesting that the factors responsible for the success of a movie or TV show should be investigated further.
Since there were a lot of missing values in the data set, it would be useful to look for a more complete one. After that, other factors, such as age certification, must be analyzed as well.
In conclusion, I would recommend Netflix create new content rather than suggest old selections, take into account the most popular genres, such as Sci-Fi, Fantasy, and Action, and support the creation of new content from the nations that create the most popular titles.