Into the Male Gaze

A Story of Gender Representation in Cinema

Framing our research

Art imitates life and life imitates art - Cinema has the power to capture the zeitgeist of an era. In this project we will be studying gender representation in cinema to better understand trends in societal sentiments. Our goal is to assess if representations differ by gender and if they have evolved through time. This project is motivated by the fact that the 20th century was a time of dynamic social upheaval and mobility, e.g., in Switzerland, women’s suffrage at the federal level was granted as late as 1971.

Culture and Cinema share a dynamic intertwined relationship. Cinema reflects the values and patterns of the culture which produced it while simultaneously shaping and influencing the culture itself. In this manner, cinema and culture display a feedback loop of influence. This project is predicated on the idea that character portrayal in cinema serves as a mirror for society, which gave rise to the character. Character focused analysis will lean heavily on Stanford NLP library in order to understand lexical groups by which characters are represented.

Cinema’s Increasing Impact

Since its conception in the late 19th century, The cinema industry’s relevance and impact has continued to grow. In America alone, three in four people reported going to the cinema just last year… Cinema today has grown into a major global industry - with a market size in the hundred billions USD. People flock to theatres and now increasingly streaming services as venues to consume the latest releases. It has become increasingly common not only as amusement, but also to promote social and political agendas. As the industry grows, the dynamic interplay between cinema and culture continues to reinforce itself, and is likely to only increase in the future.

Figure 1
Figure 2

Around the cinema in 50 countries

The dataset consists mainly of movies from the United States, India, Japan, and Western Europe. While this dataset represents a diverse range of cultures and societies from these regions of the world, it is important to note that it is not a comprehensive representation of all areas of the world. The analysis of this dataset will provide insight into the depiction of society in these specific regions, but it may not accurately reflect the portrayal of society in other parts of the world. It is important to consider the limitations of the dataset when using it to study the representation of social issues, cultural norms, and values in these regions, and to recognize that the findings may not be applicable to all areas of the world.
Figure 3

How are main character roles distributed around the globe?

We defined main characters as those which appeared with the highest frequency in the movie summary. a cursory glance at the amount of main characters per gender and region reveals a lopsided distribution. This discrepancy is more pronounced in Eastern Europe (6 to 1 male to female) and Latin America (just over 3 to 1 male to female) while still holding in North America (1.7 to 1) and Western Europe (1.6 to 1). Overall male actors are the main character in roughly 1800 movies, while female actors in roughly 1000 (1.8 to 1)

Figure 4

Who does what to whom, when and how?

The following questions do not comprise an exhaustive list, but are fundamental to study gender representation in movies. They serve to scope, inspire, and guide the analysis.

  • What is the prevalence of male and female characters?
  • Is there a discrepancy in age between male and female characters?
  • How can we differentiate between depictions of male and female characters?
    • How do they act? How do others act onto them?
    • How are they described?
  • Can we extract archetypes / stereotypes across genders through lexical analysis?
  • Do the aforementioned questions show an evolution over time? Do they show discernable differences across geographies?

Data Exploration

We will begin with a few preliminary explorations of our data. There has been a longstanding imbalance in the film industry, with men occupying a larger fraction of representation on screen, often in more distinguished roles. This leads to a simultaneous marginalization and perpetuation of stereotypes via the portrayal of women.

Older Gentlemen, Younger Ladies

In Figure 5 we are displaying the average age at movie release for female and male actors, as well as a linear regression of the average age evolving over time. The shaded areas represent the 95% confidence interval for the regression. We see that there is indeed a difference between the average age of actors and actresses, where actors are on average older than actresses.
There is also an apparent trend of increasing age, both for actresses and actors. It is unlikely that this is actually the case, but rather there is little age data for older actors at the beginning of the 20th century. This would be related to a less rigorous keeping which was not as good as it is now. The younger actors, whom we do have data on, are disproportionately represented in these early years i.e. they lived long and late enough in the century to enter modern databases.
This reflects a common cinema trope in which women are portrayed by actresses far younger than their male counterparts, even if they’re supposed to be relatively the same age. This trope can reinforce and perpetuate the notion that youth and beauty are of qualities of paramount importance for women, and of less importance for male counterparts. This theme is concurrent with the idea that women have a limited ‘shelf life’ and are less valuble as actresses as they age.

Figure 5

Less female roles

In Figure 6 we depict the proportion of actors by gender through time. Once again we note a disparity in the data - namely that women are less present in cinema than men over the last odd century. We note irregularities in the data in early years (roughly 1880’s-1910’s), as well as the last decade. This is once again due to the tiny samples being drawn from as seen in Figure 1 . The small data sizes available for these early years give rise to wide, irregular outcomes. This disparity of women in cinema leads to a significant limitations on the types of characters they represent and the stories told. Now that we have confirmed there is indeed some surface level imbalances in representation, we will continue to explore further.

Figure 6

Lexical Field Analysis

We begin our analysis by analyzing plot summaries for attributes and patient verbs of each character. Attributes here refers to descriptive words associated with each character in the movie plot summaries while patient verbs refers to actions being conducted on the character. Each label on the y-axis refers to a group of associated words (example below). In each case all the data is aggregated over male and female characters so that we might understand how plot summaries describe and what type of actions are conducted on each gender. only the top groups are shown for each gender

Agent Verbs

In figure 7 we filter out any common groups between the common top frequency male and female agent verbs. We are left with exclusive word groups by which we may better differentiate the genders. To reiterate, agent verbs are verbs conducted BY the character. Words have been reduced to their lemma to capture various conjugations.

Key takeaways:
  • The male exclusive list is composed largely by affirmative and action based words like order, beat, rescue, proceed, push, intend, declare. This conjures the image of men being differentiated by more dominating and active actions.
  • The women list slants more towards emotional and relational words such as shock, care, scream, share, urge, pregnant, and kiss. This in turn indicates women taking more passive roles.
Figure 7

Patient Verbs

In figure 8 here we see patient verbs - actions conducted on each gender. Many of the groups are generic and shared across both genders - so we focus on the differences, subtle as though they may be. For example, women are the subject of kidnapping words far more often than their male counterparts, while men are the subject of arrest at a much higher frequency. It is interesting to note that the groups appearing the most, for both genders, concern mostly family members or relationship between people, However, we can see that "father" is more often related to a woman than "girlfriend", "girl", "daughter" or even "mother" for a man. This could indicate that a woman is more often defined by or related to masculine relatives than a man is to feminine relatives.
The same gender exclusive filtering is done with patient verbs in figure 9. To reiterate, patient verbs are verbs conducted ON the character. The results provide quite a stark contrast in the treatment of characters on screen. Words have been reduced to their lemma to capture various conjugations.

Key takeaways:
  • Women are the subject of sexual, physical, and emotional abuse through the groups rape, harass, impal, doom, stun, abuse, sadden.
  • Men are the subject of physical war and violence through the groups wound, defeat, blow, execute, ambush, and gun.
Figure 8

Figure 9


In figure 10 we see the groups associated with each gender. Once again, the most frequent groups are relational and shared between the genders. Discrepancies are of more interest - Categories that appear in the women's common groups and not men include lover, promiscuous, feelings, beautiful, teacher, and body. It is important to note that each of these groups represents a set of words. For example, the ‘body’ group seen in the women is composed of the words such as thigh shivering skin thumb feel touch breast foot muscle shoulder bony stiff clothed ankle crotch frame collarbone. These begin to reveal the story of what descriptors are employed on female characters in cinema relative to their male counterparts. On the converse, male groups include boss, angry, and police. Once again giving clues towards what typecasting occurs by gender.
Though less impressive results than the previous exclusive lists (see Fig. 7 and 9), we are still able to extract some interpretation from Figure 11.
  • Obviously the gendered words are top words in each respective gender. It is however interesting to note that women are much more often described by their gender-specific attributes of ‘girl’, ‘women’, and ‘pregnant’ (top 3 in fact!) relative to men in which we find just ‘boy’.
  • Once again the trope of women being categorically reduced to their pro-creative, physical, and sexual qualities is depicted through ‘pregnant’, ’marriage’, ’lover’, ’affair’, ’beautiful’ - all usually relative to some male counterpart.
  • With men we see the emergence of hierarchy, dominance, and work-related subjects in fire, boss, college, angry, assistant.
Figure 10

Figure 11

Differentiating Genders

PCA was conducted to determine if the word categories are sufficient in differentiating between genders. Figure 12 graphically represents the separation of m/f characters by the first two primary components.
  • Principle component 1 is able to efficiently separate male and female characters. PC1 is highly positively correlated with groups such as family, wedding, attractive, domestic work, sexual, feminine, beauty, while being highly negatively correlated with groups including money, crime, masculine, prison, aggression, war, and fighting. These extremes separate our characters in two very distinct sets.
  • Principle component 2 is far less effective (there is more overlap along this dimension). However, it does subtly reveal that the range of word groups associated with men are more diverse than that of women. This is conducive with the theory that representation of women in media is rather limited and women are cast in a flat set of roles.

Figure 12

How about differentiating through time?

To have a better understanding of the evolution of the female and male characters representation in time we divided the time periods in which the movies where released (before 1960, 1960-2000, after 2000), and we created groups based in this division and the gender of the characters (see Fig. 13). Then, we used as PCA samples the different groups. The results show that the first principal component can discriminate between female and male characters, and that the difference between male and female characters for a same period of time decreases as we advance in time. From this analysis, the female representations are getting closer to the male ones. The first principal component is the one explaining more variance in the data (>50%), and in this case it shows that distance between female and male characters belonging to a same time period decreases as time goes by. Interestingly, for this component male characters have not evolved since the 1960's, while females representations are still changing.

Figure 13

And by region?

After exploring the differences in gender representation in movies, both generally and taking time into account, we explored the differences for the world regions. The results show that there is still a separation between male and female characters. For some regions, male representations (as well as female) are similar, such as for North America, Western Europe and Oceania. Whoever, Africa and Western Asia are very far from the rest. It is possible that having released less movies, the lack of data for those regions is affecting the results. For those regions, we can also observe that male and female characters are very far from each other.

Figure 14

Sentiment Analysis

In the following visualization (Fig. 15), we compiled bags of words over the entire movie summaries split by gender. All words describing women and men were separately tagged into their group words shown on the vertical axis.The displayed results are normalized to account for the different amount of male and female characters. The results have been sorted in descending frequency for the women's results.
  • The bag of women for women has much higher relative extremes than that of men. This indicates that women have a narrower range of associated words and could be interpreted as an indicator for the lack of depth in descriptors and actions associated with female characters
  • Conversely, the male distribution is closer to normal, indicting a much broader pool of associated lexical fields and therefore a more broad set of roles, attributes, and actions done by and onto them.
  • Men outscore women in the following groups: death, leader, business, violence, kill, crime, driving, aggression, fight, stealing, prison, war, weapon… This corroborates the results of PC1 in which we saw overlapping terms as having the strongest explanatory coefficient

Figure 15

Alice and Sherlock

We then generated word clouds based on the same gender-separated bag of words. This time however we visualize the results without discarding low-score groups. This demonstrates the following:
  • The male bag of words (Fig. 17) displays a larger breadth and more uniform distribution relative to women.
  • Implicitly, women have a narrow and skewed distribution which is evident from the few words and relative larger size of a few groups (Fig. 16).
Figure 16

Figure 17


We employ clustering with our word groups in an attempt to naturally extract underlying archetypes - these would be characterized by sets of groupwords. Clustering was conducted for variable amounts of clusters - we will focus on the 5 cluster case. In the figure below we see the results of each cluster where a score further from zero (more green) indicates that this feature is more important for defining its respective cluster row.
Figure 18


  • Cluster 0 is composed of high scores in negotiation, white collar job, occupation, law, wealthy, banking, economics, and business, work, and government. We will refer to this as the ‘business’ cluster.
  • Cluster 1 is composed of zero high scores! we will refer to it as the ‘ambiguous’ cluster.
  • Cluster 2 is composed of high scores in children, home, family, royalty, wedding, domestic work, celebration, death. These are largely associated with a happy home, family, and important events. We will refer to it as the ‘Life & Community’ cluster.
  • Cluster 3 is composed of high scores in feminine, love, beauty, childish, appearance, sexual, attractive, affection, sadness, optimism, sympathy, confusion, nervousness, timidity. It is dominated by feminine, aesthetic, and emotional themes. We will refer to it as the ‘feminine’ cluster.
  • Cluster 4 is composed of high scores in weapon, fight, military, war, crime, sadness, prison, power, hate, violence, negative emotion, aggression, pain, politics, and rage. We see it is dominated by themes of aggression, combat, and negative emotions. We will refer to this as the ‘hostile’ cluster.

Clusters Through Time

Now that we have defined the clusters, we want to explore each to better understand the characters they are composed of. The adjacent figure describes the 5-cluster case decomposed by gender of the actor and decade of film release. We see from the bottom right aggregate figure that roughly 40% of all characters are female actresses and 60% male. Relative to this baseline we can then see a disproportionate representation of male characters in the business, ambiguous, and hostile cluster. Conversely, women make a larger percentage than the baseline of the feminine and life & community cluster. We hoped to see an evolution of the archetypes through time, but this is largely not the case. There are fluctuations over time but no discernable trends. An exception is once again noted in the early years - female characters being 100% of the hostile cluster - due to the tiny sample size generating the data.
Figure 19


Ultimately our exploration of the CMU movie summary provided mixed results. While we were certainly able to extract differences in the cinema industry's treatment of characters based on gender, they were quite limited in expression. The differences are exemplified in the different agent and patient verbs as well as attributes. Using these features we employed a variety of lexical groupings, primary component analysis, and clustering to differentiate the male and female characters. What fell somewhat flat was the analysis through time as our limited definitions of archetype clusters did not behave as dynamically as we had supposed they would. This is not inherently a shortcoming but simply demonstrates that cinema indeed has a ways to go until an equitable distribution of descriptions and roles are had between genders. Another observation was that the way we formed clusters allowed for more or less differences in archetypes representation through time. Fine tuning the clusters may have led to more archetypes with more transience through the decades. Ultimately we would have liked to explore different lexical tools than empath which we relied upon heavily but were unable to produce a large amount of disparate gendered clusters from. Data was also limited due to movie summaries not being as rich as the script, directing, or cinematography which we believe would play a humongous differential role in treatment by gender.

About CMU Movie Summary Corpus

All data comes from the CMU Movie Summary Corpus Datasets. CMU Movie Summary is an open corpus containing 42,306 movie plot summaries extracted from Wikipedia, as well as metadata from Freebase including revenues, genres, release dates, runtimes, languages, character names, and actor information. It was compiled by the Language Technologies Institute and Machine Learning Department at Carnegie Mellon University. Movie data ranges from 1888 to 2016. The CMU Corpus is publicly available here.