Movie Duration Trend Analysis (using Netflix dataset)

Aron Akhmad
4 min readNov 3, 2022

Movies are a type of entertainment people love to have in their spare time. Over few decades, movies are getting more and more consumed as accessibility is much better and easier now. And as technology is getting better over time, the movie industry has grown its quality as well. However, the making process is getting much more complex either. Would the more difficult movie-making process impact the movie duration from then to now? Let’s do some simple research on it!

First, load the Netflix CSV dataset and store its value in a data frame called durations_df. Print the first 5 data of durations_df using the head() function to get a glimpse of it.

# Import pandas under its usual alias                             import pandas as pd                                                         # Read in the CSV as a DataFrame                             netflix_df = pd.read_csv("datasets/netflix_data.csv")                                                           # Print the first five rows of the DataFrame                             netflix_df.head()

As you can see, there are a lot of columns on the dataset. We need to filter the columns to only those needed for analysis to make it more efficient. In this case, we only need ‘title’, ‘country’, ‘genre’, ‘release_year’, and ‘duration’.

# Subset the DataFrame for type "Movie"                             netflix_df_movies_only = netflix_df[netflix_df['type'] == 'Movie']                                                           # Select only the columns of interest                             netflix_movies_col_subset = netflix_df_movies_only[['title', 'country', 'genre', 'release_year', 'duration']]                                                           # Print the first five rows of the new DataFrame                             netflix_movies_col_subset.head()

To get a look at the spread of the data, we visualize each movie and its duration based on its release year. From the plot, we can see there are more movies created in recent decades and the durations get more diverse.

# Create a figure and increase the figure size                             fig = plt.figure(figsize=(12,8))                                                 # Create a scatter plot of duration versus year                             plt.scatter(netflix_movies_col_subset['release_year'], netflix_movies_col_subset['duration'])                                                           # Create a title                             
plt.title("Movie Duration by Year of Release")
# Show the plot
plt.show()

From the previous plot, we can see there are a lot of movies in recent decades and there are way more movies with shorter duration as well. This can be overrepresented. To get a more appropriate analysis, we need to dive deeper into it and see which genre movies with a duration of fewer than 60 minutes fall into.

# Filter for durations shorter than 60 minutes                             short_movies = netflix_movies_col_subset[netflix_movies_col_subset['duration'] < 60]                                                           # Print the first 20 rows of short_movies                             short_movies.head(20)

From the table shown above, we can see that movies that have a duration of fewer than 60 minutes fall into genres such as “Children”, “Stand-Up”, and “Documentaries”. We should trim them out by marking them with different colors than the others. In this code below, we will color “Children” movies with red, “Stand-Up” movies with green, and “Documentaries” movies with blue. Meanwhile, the rest of the movies will be black.

# Define an empty list                             
colors = []
# Iterate over rows of netflix_movies_col_subset for lab, row in netflix_movies_col_subset.iterrows() : if row['genre'] == 'Children' : colors.append('red')
elif row['genre'] == 'Documentaries' : colors.append('blue')
elif row['genre'] == 'Stand-Up' : colors.append('green')
else:
colors.append('black')
# Inspect the first 10 values in your list print(colors[:10])

Since we’ve colored all the movies based on their genre, we can now re-plot our data and see what the new plot is like.

# Set the figure style and initalize a new figure                             plt.style.use('fivethirtyeight')                            
fig = plt.figure(figsize=(12,8))
# Create a scatter plot of duration versus release_year plt.scatter(netflix_movies_col_subset['release_year'], netflix_movies_col_subset['duration'], color = colors) # Create a title and axis labels plt.title("Movie duration by year of release") plt.xlabel("Release year") plt.ylabel("Duration (min)") # Show the plot
plt.show()

You can see from the previous plot that movies with short duration in recent decades are mostly those that fall into “Children”, “Stand-Up”, and “Documentaries” categories. So now, we can count them out and continue our analysis with more accurate data in knowing whether movie duration in recent years is decreasing or not.

# Are we certain that movies are getting shorter?                             netflix_movies_col_subset['colors'] = colors                             are_movies_getting_shorter = netflix_movies_col_subset[netflix_movies_col_subset['colors'] == 'black']                             x = round(are_movies_getting_shorter.groupby('release_year').mean(), 2)                             x.plot()

We have plotted the data and from the plot, we can see that the movie duration was trending up from 1940 to the first quarter of the 1960s but then it fluctuates afterward, even in recent years. Thus, from the analysis we’ve made, we can conclude that the speculation or hypothesis saying that the movie duration is decreasing in recent years is false since the movie duration fluctuates each year.

--

--

Aron Akhmad

〖A data geek 📊〗〖Life-long learner〗〖ESFP-T〗〖✨ŸØⱠØ✨〗