# Blog Archives

## Exploratory data analysis and film form

Following on from my earlier posts on the editing structure of slasher films, this week I have a draft of a paper that combines my early observations (much re-written) along with an analysis of the relationship between editing and the narrative structure of *Friday the Thirteenth* (1980)

Exploratory data analysis and film form: The editing structure of slasher filmsWe analyse the dynamic editing structure of four slasher films released between 1978 and 1983 with simple ordinal time series methods. We show the order structure matrix is a useful exploratory data analytical method for revealing the editing structure of motion pictures without requiring

a prioriassumptions about the objectives of a film. Comparing the order structure matrices of the four films, we find slasher films share a common editing pattern closely comprising multiple editing regimes with change points between editing patterns occur with large changes in mood and localised clusters of shorter and longer takes are associated with specific narrative events. The multiple editing regimes create different types of frightening experiences for the viewer with slower edited passages creating a pervading sense of foreboding and rapid editing linked to the frenzied violence of body horror, while the interaction of these two modes of expression intensifies the emotional experience of watching a slasher film.

The paper can be accessed here: Nick Redfern – The Editing Structure of Slasher Films.

The shot length data for all four films can be accessed as a single Excel file: Nick Redfern – Slasher Films.

Analysing the editing structure of these slasher films is only part of this paper. Another goal was to outline exploratory data analysis as a data-driven approach to understanding film style that avoids a specific problem of existing ways of thinking about film style.

Existing methods of analysing film style make *a priori* assumptions about the functions of style and then provide examples to support this assertion. This runs the risk of begging the question and *circulus in probando*, in which the researcher’s original assumption is used as a basis for selecting the pertinent relations of film style which are then used to justify the basis for making assumptions about the functions of film style. We would like to avoid such logically flawed reasoning whilst also minimising the risk that we will miss pertinent relations that did not initially occur to us. By adopting a data-driven approach we can derive the functions of film style by studying the elements themselves without the need to make any such *a priori* assumptions. Exploratory data analysis (EDA) allows us to do this by forcing us to attend to the data on its own terms.

Although this is a method developed within statistics, EDA can be applied not just to numerical data but to any situation where we need to understand the phenomenon before us. For example, I had not noticed that the number of scenes between hallucinations in *Videodrome* reduces by constant factor until I sat down and wrote out the narrative structure of the film (see here).

Two very useful references are:

**Behrens JT** 1997 Principles and practices of exploratory data analysis, *Psychological Methods* 2 (2): 131-160.

**Ellison AM** 1993 Exploratory data analysis and graphic display, in SM Scheiner and J Gurevitch (eds.) *Design and Analysis of Ecological Experiments*. New York: Chapman & Hall: 14-45.

In this paper I discuss some relations between editing and the emotional experience of watching slasher films, and below are listed some interesting references that follow on from last week’s collection of paper on neuroscience and the cinema:

**Bradley MM, Codispoti M, Cuthbert BN, and Lang PJ** 2001 Emotion and motivation I: defensive and appetitive reactions in picture processing, *Emotion* 1 (3): 276-298.

**Bradley MM, Lang PJ, and Cuthbert BN** 1993 Emotion, novelty, and the startle reflex: habituation in humans, *Behavioural Neuroscience* 107 (6): 970-980.

**Lang PJ, Bradley MM, and Cuthbert BN** 1998 Emotion, motivation, and anxiety: brain mechanisms and psychophysiology, *Biological Psychiatry* 44 (12): 1248-1263.

**Lang PJ, Davis M, and Öhman A** 2000 Fear and anxiety: animal models and human cognitive psychophysiology, *Journal of Affective Disorders* 61 (3): 137-159.

**Willems RM, Clevis K, and Hagoort P** 2011 Add a picture for suspense: neural correlates of the interaction between language and visual information in the perception of fear, *Social Cognition and Affective Neuroscience* 6 (4): 404-416.

## The editing structure of The House on Sorority Row (1983)

Following on from earlier posts on the editing structure of *Halloween* (here) and *Slumber Party Massacre* (here), this week I look at the editing in *The House on Sorority Row* (1983). The shot length data can be accessed here: Nick Redfern – The House on Sorority Row. The shot length data has been corrected by a factor of 1.0416, and includes the opening credits since these are shown over footage of the characters and locations and are therefore relevant to the narrative.

As before I’m using the order structure matrix to visualise the time series of the data for this film, but to make clearer how the matrix relates to observed data values I’ve included two run charts in Figure 1 showing the shot lengths (bottom) and the ranks of the shot lengths (middle).

**Figure 1** Order structure matrix (top), ranks (middle), and shot length data (bottom) for *The House on Sorority Row* (1983)

With a median shot length of 3.0s and interquartile range of 3.7s *The House on Sorority Row* is edited more quickly than *Halloween* (median = 4.2s, IQR = 5.7s) but is similar to *Slumber Party Massacre* (median = 3.2s, IQR = 4.5s). There is no clear trend in shot lengths across the whole film and there are no clear distinctions between different narrative sections similar to the very abrupt shift we see in the final third of *Halloween*. Nonetheless, this film follows the general formal pattern set out in the earliest films of this sub-genre, with a number of clusters of longer and shorter takes associated with the same types of narrative events as in the other films. The replication of narrative events, character types, themes, and actions in the slasher film has been extensively analysed, and looking at their editing structure in detail it becomes very clear just how quickly a single style of editing became established in this type of film. There are only a few years between them, but the only major difference between *Halloween*, *Slumber Party Massacre*, and *The House on Sorority Row* is that the latter two films are cut more quickly.

The main feature in Figure 1 is the confrontation between and the girls that begins at shot 302 and runs until shot 440. This sequence is edited very quickly (Σ = 362.2s, median = 2.0s, IQR = 2.1s), but it is clear from Figure 1 that from shot 302 to shot 366 the length of the shots actually get shorter as the scene reaches its peak: the girls force Mrs. Slater into the swimming pool at gun point and the moment of greatest tension – as one of the girls fires a shot into the pool – is the point at which editing is fastest. From shot 367 the sequence slows down using longer shots, and this can be clearly seen in the order structure matrix and the run chart of the ranks. Of course, longer is a relative term, and the ‘slowing down’ of the editing in the second part of this scene means a shift from shots less than 0.5 seconds to shots between 1.5 and 5 seconds (though there are few longer than 10 seconds). (The editing in this sequence is related to the cluster of short shots that can be seen as the white column at shots 89 – 102, and which features Vicki practising with the gun). There is clearly a relationship between the way in which this scene is edited and the way in which the emotional impact of the scene is generated; and, while it is clear from watching the film that it is edited very quickly, it is easier to appreciate how this scene is structured by looking at the time series given the difference between shorter and longer shots may only be a couple of seconds.

The other clusters of shorter takes serve a different function but are also related to moments of intense emotion. The cluster beginning 165 is part of a sequence of photographs of Mrs. Slater’s old sorority classes that begins quite slowly as the camera pans across the photos; but from shot 165 there is a change to rapid editing (accompanied by a change in the music and the use of whip pans) as Mrs. Slater tears up the pictures and burns them. Again, the change in editing style is associated with a change in the mood of the scene. The cluster of short shots from shot 855 to shot 874 is typical of the rapid editing in the latter stages of a slasher film, and is associated with the killing of Vicki and Liz as they dispose of a body. The intensity of the violence is reflected in the intensity of the editing.

This last cluster sits between two sequences edited much more slowly. The dark column in the matrix between shots 797 and shot 854 focuses on Katherine’s attempt to raise help by calling Dr. Beck, and his subsequent arrival and explanation of the night’s events. It also includes the scenes in the graveyard and the attempts to dispose of a body that we know results in disaster. This sequence is heavy on plot since it explains much if the background about Mrs. Slater and her son, Eric (i.e. the killer). The sequence that follows on from the deaths of in the graveyard (shots 875-897) shifts us back to Katherine and Dr. Beck, and is again lacking action while setting up the film’s finale.

The earlier clusters of longer takes slow down the pace of the film in order to create a pervasive sense of foreboding that de-accentuates the violence of the killings and which seek to put the viewer on edge. Shots 480-540 focus on the girls at the party and their anxiety that the body of Mrs. Slater might be discovered. This is framed as a series of long takes as Katherine meets Peter and resists his attempts to make her enjoy the party; and is notable for an elaborate tracking shot as the girls exchange glances across the dance floor. This cluster also includes the scene in which makes the rookie mistake of going down to a darkened cellar by herself to check the fuse box, and again uses a slow editing pattern to build tension before she is finally dispatched. Similarly, shots 655-692 follow Katherine as she tries to find the girls who have gone missing from the party and explores the attic room of the Mrs. Slater’s murderous son. These scenes are again important for establishing plot points and Katherine finds important symbolic objects (e.g. the jack-in-the-box), but their main purpose is to build up a state of nervous apprehension in the viewer. Interestingly, this is achieved by using slow panning shots from Katherine’s point-of-view whereas such shots in slasher films are typically used to represent the killer’s stalking of his victims. This sequence also includes the other members of the sorority trying to dispose of Mrs. Slater’s body only to run into a policeman. These sequences and the various narrative threads they present serve to create an emotionally tense atmosphere for the viewer but unlike the aggressive tensity of the rapidly cut sections this mood is one of foreboding.

This use of two different editing patterns to create two different moods for the viewer is characteristic of the slasher film and can also be seen in the time series of *Halloween* and *Slumber Party Massacre*. We tend to speak of the style of a film in singular terms as though it definitely has one – and only one – mode of expression; but since the slasher film uses different editing patterns to create different effects it would make more sense to talk of the *styles* of these films. This can also be seen in the time series of RKO musicals (see here, here, and here).

The ‘final girl’ sequence begins at shot 985 (Σ = 434.4s, median = 2.7s, IQR = 2.1s). Here *The House on Sorority Row* does show some (minor) differences to *Halloween* and *Slumber Party Massacre*. In this film we have a progressive increase in the cutting rate, and the shift to shorter shots is particularly marked in the run chart of the shot ranks. The first part of this sequence is edited relatively slowly as Katherine makes her way through the sorority house to the attic, and this can be seen in the dark column at this point in the matrix in Figure 1. This is different to the other films in which this corresponding sequence begins when the killer attacks the final girl (as can clearly be seen at shot 437 in the matrix for *Halloween*). In *The House on Sorority Row* the final girl goes looking for the killer. Once the struggle between Katherine and the killer begins (shot 1063) we see the same rapid editing observed in the *Halloween* and *Slumber Party Massacre*, but we do not see the same fast-slow-fast pattern noted in the other films as the struggle between the killer and the final girl is temporarily suspended. This is due to the postponement of the killer’s return once we think he has been killed. The last shot of the film is a close-up of the eye as we discover Katherine has not defeated him and assume their struggle to the death will continue. *The House on Sorority Row* presents the same final girl sequence as the other slasher films I have looked at but cuts the narrative (and therefore the editing pattern) off before it reaches its ‘natural’ conclusion.

Like *Halloween*, *The House on Sorority Row* was remade in 2009 and a future post will look at the similarities and the differences between the original version of these films and their later reinvention.

## Using the ECDF to analyse film style

Last month I looked at using kernel densities to analyse film style, and to follow-up this week’s post will focus on another simple graphical method for understanding film style: the empirical cumulative distribution function (ECDF).

Although it has a grand sounding name this is a very simple method for getting a lot of information very quickly. Most statistical software packages will calculate the ECDF for you and draw you a graph, but it is very simple to create an EXCEL or CALC spreadsheet to do this since it does not require any special knowledge.

The ECDF gives a complete description of a data set, and is simply *the fraction of a data set less than or equal to some specified value*. Several plotting positions for the ECDF have been suggested, but here we use the simplest method:

which means that you count the number of shots (*x*) less than or equal to some value (*X*), and then divide by the sample size (*N*). Do this for every value of *x* in your data set and you have the ECDF. We can interpret this fraction in several ways: we can think of it as the probability of randomly selecting an *x* less than or equal to *X *(*P*[*x* ≤ *X*]); or we can think of it as the proportion of values less than or equal to *X*; or, if we multiply by 100, the percentage of values in a data set less than or equal to *X*.

For example, using the data set for *Easy Virtue* (1928) from the Cinemetrics database available here we can calculate the ECDF as illustrated in Table 1.

**Table 1** Calculating the ECDF for *Easy Virtue* (1928) (*N* = 706)

To start, look at the value of *X* in the first column and then count the number of shots in the film with length less than or equal to that value. The first value is 0.9 but there are no shots this short in the film and so the frequency is zero. Divide this zero by the number of shots in the film (i.e. 706) and you have the ECDF when *X* = 0.9, which is 0 (because 0 divided by any number is always 0). Next, *X* = 1.0 seconds and there is 1 shot less than or equal to this value and so the ECDF at *X* = 1.0 is 1/706 = 0.0014. Turning to *X* = 1.1 we see there are three shots that are 1.1 seconds long AND there is one shot that is shorter in length (i.e. the one at 1.0s), and so the ECDF at *X* = 1.1 is 4/706 = 0.0057. This is equal to the frequency of 1.0 second long shots divided by *N* (0.0014) PLUS the frequency of shots that are 1.1 seconds long (3/706 = 0.0042) – and that is why it’s called the *cumulative* distribution function. From this point you keep going until to reach the end: the longest shot in the film is given as 66.6 seconds long and so all 706 shots must be less than or equal to 66.6 seconds and so at this value of *X* the ECDF = 706/706 = 1.0. The ECDF is 1.0 for any value of *X* greater than the maximum *x* in the data set.

It really is this easy. And you can get a simple graph of *F*(*x*) by plotting x on the *x*-axis and the ECDF on the *y*-axis. More usefully, you can plot the ECDFs of two or more films on the same graph so that you can compare their shot length distributions. Figure 1 shows the empirical cumulative distribution functions of *Easy Virtue* and *The Skin Game* (1931 – access the data here).

**Figure 1** The empirical cumulative distribution functions of *Easy Virtue* (1928) and *The Skin Game* (1931)

Now clearly there is a problem with this graph: because the shot length distribution of a film is positively skewed all the shots are bunched up on the left-hand side of the plot and you cannot see any detail. This can be resolved by redrawing the *x*-axis on a logarithmic scale, which stretches out the bottom end of the data which has all the detail and squashing the top end which has only a few data points. This can be seen in Figure 2.

**Figure 2** The empirical cumulative distribution functions of *Easy Virtue* (1928) and *The Skin Game* (1931) on a log-10 scale

These two graphs present exactly the same information, but at least in Figure 2 we can find the information we want. In transforming the *x*-axis we have not assumed the shot length distribution of either film follows a lognormal distribution – which is just as well because this is obviously not true for either film.

Now what can we discover about the editing in these two films?

First, it is clear that these two films have same median shot length because the probability of randomly selecting a shot less than or equal to 5.0 seconds is 0.5 in both films. The definition of the median shot length is the value that divides a data set in two so that half are less than or equal to *x* and greater than or equal to *x* (i.e *P*(*x* ≤ *X*) = 0.5. We might therefore conclude that they have the same style. However, these two films clearly have different shot length distributions and it is easier to appreciate this when we combine numerical descriptions with a plot of the actual distributions.

A basic rule for interpreting the plot of ECDFs for two films is that if the plot for film A lies to the right of the plot for film B then film A is edited more slowly. Obviously this is not so clear cut in Figure 2.

Below the median shot length, the ECDF of *The Skin Game* lies to the left of that of *Easy Virtue* indicating that at those shot lengths it has a greater proportion of shots at the low-end of the distribution: for example, 25% of the shots in *The Skin Game* are less than or equal to 2.0 seconds in length compared to just 6% of the shots in *Easy Virtue*. This would seem to indicate that *The Skin Game* is edited more *quickly* than *Easy Virtue*. At the same time we see that above the median shot length that the ECDF of *The Skin Game* lies to the right of that of *Easy Virtue* indicating that it has a lower proportion of shots at the high-end of the distribution: for example, 75% of the shots in *Easy Virtue* are less than or equal to 8.3 seconds compared to 66% of the shots in *The Skin Game*. This would appear to suggest that *The Skin Game* is edited more *slowly* than *Easy Virtue*. Clearly there is something more interesting going on than indicated by the equality of the medians, and the answer lies in how spread out the shot lengths of these two films. The ECDF of *Easy Virtue* is very steep and covers only a limited range of values, where as the ECDF of *The Skin Game* covers a much wider range of shot lengths. The interquartile range of *Easy Virtue* is 5.2 seconds (Q1 = 3.1s, Q2 = 8.3s) indicating the shot lengths of this film are not widely dispersed; while the IQR of *The Skin Game* is 12.7s (Q1 = 2.0s, Q3 = 14.7s).

This example is an excellent demonstration of why it is important to always provide a measure of the dispersion of a data set when describing film style. It is not enough to only provide the average shot length since two films may have the same median shot length and completely different editing styles. See here for a discussion of appropriate measures of scale that can be used. It should be standard practice that an appropriate measure of dispersion is cited along with the median shot length for a film by any researcher who wants to do statistical analysis of film style, and journal editors and/or book publishers who receive work where this is not the case should send it back immediately with a note asking for a proper description of a film’s style. If you don’t include any description – either numerical or graphical – of the dispersion of shot lengths in a film then you haven’t described your data properly.

We can also use the ECDFs for two films to perform a statistical test of the null hypothesis that they have the same distribution. This is called the Kolmogorov-Smirnov (KS) test, and the test statistic is simply the maximum value of the absolute differences between the ECDF of one film (*F*(*x*)) and the ECDF of another film (*G*(*x*)) for every value of *x*. The ‘absolute difference’ means that you subtract one from the other and then take only size of the answer and ignore the sign (i.e. ignore if its positive or negative):

Table 2 shows this process for the two films in Figures 1 and 2.

**Table 2** Calculating the Kolmogorov-Smirnov test statistic for the ECDFs of *Easy Virtue* (1928) and *The Skin Game* (1931)

In the first column in Table 2 we have the lengths of the shots from the smallest in the two films (0.6 seconds) to the longest (174.7 seconds), and then in columns two and three we have the ECDF of each film. Column four is the difference between the ECDFs of the two films, subtracting the ECDF of *The Skin Game* from the ECDF of *Easy Virtue* for every *x*: so when *x* = 0.6, we have 0-0.0037 = -0.0037. The final column is the absolute difference, which is just the size of the value in the fourth column and the sign is ignored: the absolute value of -0.0037 is 0.0037. Do this for every value of *x* and find the largest value in the final column.

In the case of these two films the maximum absolute difference occurs when *x* = 2.0 and is statistically significant (p < 0.01). Therefore we conclude these two films have different shot length distributions. (You may find that different statistics software give slightly different answers to this depending on the plotting position used).

An online calculator for the KS-test that will also draw a plot of the ECDFs can be accessed here, and is accompanied by a very useful explanation. (NB: this only works for data sets up to N = 1024). Rescaling the *x*-axis of our plot of the two ECDFs does not affect the KS-test since the ECDFs are on the *y*-axis and D column in Table 2 is the *vertical* difference between them.

(There is also a one-sample of the KS-test for comparing a single distribution to a theoretical distribution to determine goodness-of-fit, but there are so many other methods that do exactly the same thing better that it’s not worth bothering with).

The ECDF is very easy to calculate, the graph is very easy to produce and provides a lot of information about a data set for every little effort, and the KS-test is also a very simple way of comparing two data sets. There is no bewildering mathematics involved: just count, divide, add, subtract, and ignore. The statistical analysis of film style really is this easy.

## Robust time series analysis of ITV news bulletins

I have mentioned numerous times on this blog the importance of using robust statistics to describe film style. This week I continue in this vein, albeit in a different context – time series analysis. In a much publicised piece of work James Cutting, Jordan De Long, and Christine Nothelfer (2010) calculated partial autocorrelation functions and a modified autoregressive index for a sample of Hollywood films. While I have no problems with the basis of this research, I do think the results are dubious due to the use of non-robust methods to determine the autocovariance between shot lengths in these films. The paper attached below analyses the editing structure of the set of ITV news bulletins I discussed in a paper last year, comparing the results produced using classical and robust autocovariance functions.

Robust time series analysis of ITV news bulletinsIn this paper we analyse the editing of ITV news bulletins using robust statistics to describe the distribution of shot lengths and its editing structure. Commonly cited statistics of film style such as the mean and variance do not accurately describe the style of a motion picture and reflect the influence of a small number of extreme values. Analysis based on such statistics will inevitably lead to flawed conclusions. The median and are superior measures of location and dispersion for shot lengths since they are resistant to outliers and unaffected by the asymmetry of the data. The classical autocovariance and its related functions based on the mean and the variance is also non-robust in the presence of outliers, and leads to a substantially different interpretation of editing patterns when compared to robust time statistics that are outlier resistant. In general, the classical methods underestimate the persistence in the time series of these bulletins indicating a random editing process whereas the robust time series statistics suggest an AR(1) or AR(2) model may be appropriate.

The pdf file is here: Nick Redfern – Robust Time Series Analysis of ITV News Bulletins

My original post on the time series analysis of ITV news bulletins can be accessed here, along with the datasets for each of the fifteen bulletins.

My new results indicate the conclusions of Cutting, De Long, and Nothelfer are flawed, and that it is very likely they have underestimated the autocovariance present in the editing of Hollywood films. The discrete and modified autoregressive indexes they present are likely to be too low, though there may be some instances when they are actually too high. This is not enough to reject their conclusion that Hollywood films have become increasingly clustered in packets of shots of similar length, and I have not yet applied this method to their sample of films. It is, however, enough to recognise there are some problems with the methodology and the results of this research.

### References

**Cutting JE, Delong JE, and Nothelfer CE** 2010 Attention and the evolution of Hollywood film, *Psychological Science* 21 (3): 432-439.

## Revealing narrative structure through aesthetic analysis

This week some papers relating to the discovery of narrative structure in motion pictures based on the patterns of aesthetic elements. But first, many of the papers on statistical analysis of film style in this post and on many others across this blog are co-authored by Svetha Venkatesh from Curtin University’s Computing department, and her home page – with links to much research relevant to film studies – can be accessed here.

**Adams B, Venkatesh S, Bui HH, and Dorai C** 2007 A probabilistic framework for extracting narrative act boundaries and semantics in motion pictures, *Multimedia Tools and Applications* 27: 195-213.

This work constitutes the first attempt to extract the important narrative structure, the 3-Act storytelling paradigm in film. Widely prevalent in the domain of film, it forms the foundation and framework in which a film can be made to function as an effective tool for storytelling, and its extraction is a vital step in automatic content management for film data. The identification of act boundaries allows for structuralizing film at a level far higher than existing segmentation frameworks, which include shot detection and scene identification, and provides a basis for inferences about the semantic content of dramatic events in film. A novel act boundary likelihood function for Act 1 and 2 is derived using a Bayesian formulation under guidance from film grammar, tested under many configurations and the results are reported for experiments involving 25 full-length movies. The result proves to be a useful tool in both the automatic and semi-interactive setting for semantic analysis of film, with potential application to analogues occurring in many other domains, including news, training video, sitcoms.

**Chen H-W, Kuo J-H, Chu W-T, Wu J-L** 2004 Action movies segmentation and summarization based on tempo analysis, 6th ACM SIGMM International Workshop on Multimedia Information Retrieval, New York, NY, 10-16 October, 2004.

With the advances of digital video analysis and storage technologies, also the progress of entertainment industry, movie viewers hope to gain more control over what they see. Therefore, tools that enable movie content analysis are important for accessing, retrieving, and browsing information close to a human perceptive and semantic level. We proposed an action movie segmentation and summarization framework based on movie tempo, which represents as the delivery speed of important segments of a movie. In the tempo-based system, we combine techniques of the film domain related knowledge (film grammar), shot change detection, motion activity analysis, and semantic context detection based on audio features to grasp the concept of tempo for story unit extraction, and then build a system for action movies segmentation and summarization. We conduct some experiments on several different action movie sequences, and demonstrate an analysis and comparison according to the satisfactory experimental results.

**Hu W, Xie N, Li L, Zeng X, and Maybank S** 2011 A survey on visual content-based video indexing and retrieval, *IEEE Transactions On Systems, Man, and Cybernetics—Part C: Applications And Reviews*, 41 (6): 797-819.

Video indexing and retrieval have a wide spectrum of promising applications, motivating the interest of researchers worldwide. This paper offers a tutorial and an overview of the landscape of general strategies in visual content-based video indexing and retrieval, focusing on methods for video structure analysis, including shot boundary detection, key frame extraction and scene segmentation, extraction of features including static key frame features, object features and motion features, video data mining, video annotation, video retrieval including query interfaces, similarity measure and relevance feedback, and video browsing. Finally, we analyze future research directions.

**Moncrieff S and Venkatesh S** 2006 Narrative structure detection through audio pace, IEEE Multimedia Modeling 2006, Beijing, China, 4–6 Jan 2006

We use the concept of film pace, expressed through the audio, to analyse the broad level narrative structure of film. The narrative structure is divided into visual narration, action sections, and audio narration, plot development sections. We hypothesise that changes in the narrative structure signal a change in audio content, which is reflected by a change in audio pace. We test this hypothesis using a number of audio feature functions, that reflect the audio pace, to detect changes in narrative structure for 8 films of varying genres. The properties of the energy were then used to determine the audio pace feature corresponding to the narrative structure for each film analysed. The method was successful in determining the narrative structure for 7 of the films, achieving an overall precision of 76.4 % and recall of 80.3%. We map the properties of the speech and energy of film audio to the higher level semantic concept of audio pace. The audio pace was in turn applied to a higher level semantic analysis of the structure of film.

**Murtagh F, Ganz A, and McKie S** 2009 The structure of narrative: the case of film scripts, *Pattern Recognition* 42 (2): 302-312.

We analyze the style and structure of story narrative using the case of film scripts. The practical importance of this is noted, especially the need to have support tools for television movie writing. We use the Casablanca film script, and scripts from six episodes of CSI (Crime Scene Investigation). For analysis of style and structure, we quantify various central perspectives discussed in McKee’s book, Story: Substance, Structure, Style, and the Principles of Screenwriting. Film scripts offer a useful point of departure for exploration of the analysis of more general narratives. Our methodology, using Correspondence Analysis and hierarchical clustering, is innovative in a range of areas that we discuss. In particular this work is groundbreaking in taking the qualitative analysis of McKee and grounding this analysis in a quantitative and algorithmic framework.

**Phung DQ , Duong TV, Venkatesh S, and Bui HH** 2005 Topic transition detection using hierarchical hidden Markov and semi-Markov models, 13th Annual ACM International Conference on Multimedia, 6-11 November 2005, Singapore.

In this paper we introduce a probabilistic framework to exploit hierarchy, structure sharing and duration information for topic transition detection in videos. Our probabilistic detection framework is a combination of a shot classification step and a detection phase using hierarchical probabilistic models. We consider two models in this paper: the extended Hierarchical Hidden Markov Model (HHMM) and the Coxian Switching Hidden semi-Markov Model (S-HSMM) because they allow the natural decomposition of semantics in videos, including shared structures, to be modeled directly, and thus enable efficient inference and reduce the sample complexity in learning. Additionally, the S-HSMM allows the duration information to be incorporated, consequently the modeling of long-term dependencies in videos is enriched through both hierarchical and duration modeling. Furthermore, the use of Coxian distribution in the S-HSMM makes it tractable to deal with long sequences in video. Our experimentation of the proposed framework on twelve educational and training videos shows that both models outperform the baseline cases (flat HMM and HSMM) and performances reported in earlier work in topic detection. The superior performance of the S-HSMM over the HHMM verifies our belief that the duration information is an important factor in video content modelling.

**Pfeiffer S and Srinivasan U** 2002 Scene determination using auditive segmentation models of edited video, in C Dorai and S Venkatesh (eds.) *Computational Media Aesthetics*. Boston: Kluwer Academic Publishers: 105-130.

This chapter describes different approaches that use audio features for determination of scenes in edited video. It focuses on analysing the sound track of videos for extraction of higher-level video structure. We define a scene in a video as a temporal interval which is semantically coherent. The semantic coherence of a scene is often constructed during cinematic editing of a video. An example is the use of music for concatenation of several shots into a scene which describes a lengthy passage of time such as the journey of a character. Some semantic coherence is also inherent to the unedited video material such as the sound ambience at a specific setting, or the change pattern of speakers in a dialogue. Another kind of semantic coherence is constructed from the textual content of the sound track revealing for example the different stories contained in a news broadcast or documentary. This chapter explains the types of scenes that can be constructed via audio cues from a film art perspective. It continues on a discussion of the feasibility of automatic extraction of these scene types and finally presents existing approaches.

**Weng C-Y, Chu W-T, and Wu J-L** 2009 RoleNet: movie analysis from the perspective of social networks, *IEEE Transactions on Multimedia* 11(2): 256-271.

With the idea of social network analysis, we propose a novel way to analyze movie videos from the perspective of social relationships rather than audiovisual features. To appropriately describe role’s relationships in movies, we devise a method to quantify relations and construct role’s social networks, called RoleNet. Based on RoleNet, we are able to perform semantic analysis that goes beyond conventional feature-based approaches. In this work, social relations between roles are used to be the context information of video scenes, and leading roles and the corresponding communities can be automatically determined. The results of community identification provide new alternatives in media management and browsing. Moreover, by describing video scenes with role’s context, social-relation-based story segmentation method is developed to pave a new way for this widely-studied topic. Experimental results show the effectiveness of leading role determination and community identification. We also demonstrate that the social-based story segmentation approach works much better than the conventional tempo-based method. Finally, we give extensive discussions and state that the proposed ideas provide insights into context-based video analysis.

## Statistical literacy in film studies I

UPDATE (21 JULY 2103): A much-revised version of this post has now been published as Film studies and statistical literacy, Media Education Research Journal 4 (1) 2013: 58-71. This article can be accessed here: Nick Redfern – Film Studies and Statistical Literacy.

A theme we will return to over the course of this year’s posts is statistical literacy in film studies.

In the recent film policy review published by the DCMS (here) it was noted that there exists an artificial division between the humanities and the sciences in education in the UK and that this unhealthy for the film industry in particular.

It was noted that some curricula already allow a wider range of subjects easily to be combined but that in general students were driven to either arts and humanities, or science courses. This was not in step with the kinds of skills and talents being sought by cutting edge, creative film companies or in the competitive arena of post-production and special effects.

The Panel recognises that it is vital to the success of the creative industries in the UK that pupils in secondary schools are made aware of the importance of studying arts and science in tandem rather than being pushed to choose between them. The Panel believes it is the synergy between these subjects that is crucial to the development of expertise in many of the creative sectors and especially in film. The Panel would like to see DfE building on proposals in

Next Gen,the Review by Ian Livingstone and Alex Hope undertaken for the National Endowment for Science, Technology and the Arts (NESTA) at the request of the Minister for Culture, Communications and Creative Industries.

The NESTA report can be accessed here.

The concerns of the film policy review are focussed on the need to develop a skilled workforce that can continue to make the UK a hub for production and visual effects in the global film industry, but the negative aspects of this separation can be extended to intellectual inquiry in general.

The separation between film studies and statistics can also be viewed as antithetical to the needs of the film industry. The cinema is an industry and as such requires individuals who not only understand how that industry works (which has traditionally fallen within the scope of film studies) but also understand statistics as used in economics, management, and marketing in that industry (which is most definitely not encompassed by the film studies curriculum). Arts and sciences should be taught together, and one way to achieve this in film studies is by developing statistical literacy in film scholars.

### Statistics in film studies

The study of film is a diverse field comprising four distinct but related fields of inquiry: film industries, technologies, and film policy; textual analysis; ethnographic research on audiences; and the cognitive-psychological processes of perception and cognition (see here for more detail).

Statistics is relevant to each of these four areas and film students will encounter information presented in the numerical, graphical, and tabular form in whatever aspect of the cinema they choose to study. Statistical summaries feature in many film studies texts, in newspaper and magazine articles on the cinema, and in official reports and statistical yearbooks. Indeed, the DCMS report itself uses many different statistical methods (including some really horrible doughnut graphs). Film scholars will also encounter more advanced methods in research from disciplines such as neuroscience or economics where scientific and/or statistical knowledge is commonplace.

To illustrate the use of statistics the following provides an example from each of the four areas identified above.

#### Film industries

**Simonton DK** 2005 Cinematic creativity and production budgets: does money make the movie?, *The Journal of Creative Behavior* 39(1): 1-15.

This paper examines the relationship between production budgets and box office success, awards, and critical acclaim, and uses statistical terms and methods including correlation, sample, variables, mean, standard deviation, range, Cronbach’s alpha coefficient, p-values, hypothesis tests, and tables.

#### Textual analysis

**Wang C-W, Cheng W-H, Chen J-C, Yang S-S, and Wu J-L** 2007 Film narrative exploration through the analysis of aesthetic elements, in T-J Cham, J Cai, C Dorai, D Rajan, and T-S Chua (Eds.) *Proceedings of the 13th International Conference on Multimedia Modeling – Volume I* . Berlin: Springer-Verlag: 606-615.

This paper uses statistical models to reveal the structure of narratives in films by analysing aesthetic features, and uses line charts, tables, flow charts, weighting functions, shape parameters, percentages, and sigma notation.

#### Audience research

**Hardie A** 2008 Rollercoasters and reality: a study of big screen documentary audiences 2002-2007, *Participations* 5 (1).

This paper presents the results of a questionnaire survey of audiences for documentary feature films, and uses a range of statistical methods, including percentages, bar charts, stacked pie charts, (horrible) pie charts, and tables.

#### Perception in the cinema

**Mital PK, Smith TJ, Hill R, and Henderson JM** 2011 Clustering of gaze during dynamic scene viewing is predicted by motion, *Cognitive Computation* 3 (1): 5-24.

This paper studies attention in viewing scenes in motion picture and uses a range of statistical methods and terms (alongside other scientific terms), including range, mean, non-linear statistics, Receiver Operating Characteristic curves, *k*-means clustering, histograms, line charts, tables, covariance, Gaussian mixture models, time series charts, standard error, and Bayesian Information Criteria.

Clearly understanding research on the cinema requires a relatively high level of statistical literacy, and yet I am not aware of any film studies programme that incorporates statistics as part of its tuition. Many reading the above papers will they have a grasp on what they were intended to achieve and the main results, but this is not the same as understanding why the methods used were chosen or being able to evaluate the design of a study. It is a serious failing in the instruction students receive on film studies degrees that they are expected to deal with numerical and graphical data on a regular basis without the proper training in statistical concepts and methods. For £9000 p.a. – or however much you are paying for your education – I would expect to get more than merely the gist of a piece of research.

### Statistical literacy, mathematics, and the liberal arts

Statistical literacy is to statistics as art appreciation is to artMilo Schield and Cynthia Schuman Schield

The concept of ‘literacy’ has come to mean the ‘idea of being able to find one’s way around some kind of system, and to “know its language” well enough to make sense of it,’ and foregrounds the notion of being able to ‘make meaning’ as either a producer or consumer within that system (Lankshear & Knobel 2003: 15). Education has become focussed on developing a range of literacies, such as scientific literacy, computer literacy, media literacy, and statistical literacy.

Statistical literacy may be defined as

the ability to understand and critically evaluate statistical results that permeate our daily lives – coupled with the ability to appreciate the contributions that statistical thinking can make in public and private, professional and personal decisions (Wallman 1993: 1).

Statistical literacy is directly relevant to the humanities, though it rarely features:

the ability to read and interpret summary statistics in the everyday media: in graphs, tables, statements, surveys and studies. Statistical literacy is needed by data consumers – students in non-quantitative majors: majors with no quantitative requirement such as political science, history, English, primary education, communications, music, art and philosophy. About 40% of all US college students graduating in 2003 had non-quantitative majors (Schield 2010)

One of the problems with introducing statistics into a humanities curriculum is that most students on humanities courses will have limited mathematical skills and/or low confidence in the skills they do possess. Many students may in fact be put off by the fact that film courses have some statistical content because they view it as mathematics. This problem has been widely recognised in the literature on statistical literacy, and although numeracy is a pre-requisite for statistical literacy advocates of statistical literacy stress that it is *not* the same as mathematics. For example, David S. Moore argues that statistical reasoning is one of the liberal arts because it is a flexible and broadly applicable mode of thinking, and prepares students.

Statistics is a

generalintellectual method that applies wherever data, variation, and chance appear. It is afundamentalmethod because data, variation, and chance are omnipresent in modern life. It is anindependentdiscipline with its own core ideas rather than, for example, a branch of mathematics (1998: 1254, original emphasis).

From this perspective, the emphasis in early statistical education should be on statistical thinking rather than on statistical methods, prioritizing conceptual understanding rather than computational recipes. Though it may seem contrary to the goals of teaching statistics, a first course in statistics does *not* seek to develop statisticians. Rather it seeks to develop a set of skills and attitudes that allow scholars to be able to engage with the information presented to them. A list of goals for students in developing statistical literacy is provided by Gal and Garfield (1997: 3-5), and includes,

- understanding the principles and processes of scientific discovery,
- understanding the role of statistics in scientific discovery,
- understanding the logic of statistical reasoning,
- understanding statistical terms,
- the ability to interpret results presented in tabular, numerical, and graphical form, and to be aware of possible source of variation and bias,
- the ability to communicate using statistical and probabilistic terminology properly,
- developing a critical stance towards research that purports to be based on data,
- developing the confidence and willingness to engage with quantitative research.

The purpose in obtaining these skills is to become a *statistical thinker* ‘able to critique and evaluate results of a problem solved or statistical study’ (Ben-Zvi & Garfield 2004: 7).

A similar approach is proposed by Milo Schield who argues that statistical thinking is a form of critical thinking:

statistical literacy, critical thinking about statistics as evidence, is an integral component of a liberal education since a key goal of statistical literacy is helping students understand that statistical associations in observational studies are contextual: their numeric value and meaning depends on what is taken into account. The need to deal with context and confounding is ubiquitous to all observational studies whether in business, the physical sciences (e.g., astrophysics), the social sciences, or the humanities (Schield 2004).

By introducing the topic in this way to students who are already (or should be) familiar with critical thinking should make it easier to encourage them to engage with data-based arguments. It is in this context that we understand the epigram that heads this section. Another perspective is to view statistical literacy as *quantitative rhetoric *(Schmit 2010), which again focuses on ‘critical thinking, analysis of argumentation and persuasion, and an ability to interpret statistics in context.’

A direct parallel may be drawn between statistical literacy and media literacy. ‘Media literacy’ refers to the ability of individuals to access, understand, and create communications in a variety of contexts. It is one of the justifications for film studies and similar fields that it produces media literate citizens. Similarly, courses in statistical literacy aim to produce statistically literate citizens who are able to interpret, evaluate, and use quantitative information when it is presented to them. Since this information often comes to us via the media, statistical literacy and media literacy cannot be separated.

The role of employability in higher education may be defined as ‘equipping individuals to secure their own economic success’ (Denholm et al. 2003: 12) and covers traditional academic skills, personal development skills, and enterprise or business skills (Purcell & Pitcher 1996). Statistical literacy clearly falls within this definition, and selling such courses to students (who are paying a lot of money) needs to stress this dimension. Presenting statistical literacy within film studies in these terms is a direct response to the observations of the DCMS policy review noted above.

Statistical literacy is different from *statistical competence*, in which individuals function as data producers and analysers in producing original empirical research rather than consumers presented with a completed study. Naturally, we want students to develop the necessary skills that will allow them to produce high quality original research, and it is clear that much research in film studies will require the ability to design studies, collect and manage data, perform statistical analyses, and communicate those results. This depends on statistical literacy – just as you cannot write without being able to read, you cannot become competent in statistical methods without first understanding the role of statistics in empirical research, the ability to communicate ideas in tables, numbers, or graphs, or the willingness to engage with quantitative methods. *Every* film student needs to be statistically literate, but only those who wish to engage in quantitative research requiring the use of statistical methods need to master procedural skills.

However, I do think that *every* film studies post-graduate should receive some training in statsitical research methods.

### Statistical literacy resources

There is a very large body of literature in the subject of statistical literacy. Fortunately, there are some excellent resource pages that gather this information and some of these are listed here.

- Statlit.org: a good place to start.
- International Statistical Literacy Project
- Journal of Statistics Education: a special issue on statistical reasoning from 2002. The paper by Joan Garfield should be read by anyone interested in statistics in film studies.
- UK Parliament’s summary of statistical literacy
- Milo Schield’s papers on statistical literacy can be accessed here.

The following papers referred to above can also be accessed freely online (other references are given below):

**Gal I** 2002 Adults’ statistical literacy: meanings, components, responsibilities, *International Statistical Review* 70 (1): 1-51.

**Gal I and Garfield** J 1997 Curricular goals and assessment challenges in statistics education, in I Gal and JB Garfield (eds.) *The Assessment Challenges in Statistics Education*. Amsterdam: IOS Press: 1-13.

**Moore DS** 1998 Statistics among the liberal arts, *Journal of the American Statistical Association* 93 (444): 1253-1259.

**Schield M** 2004 Statistical literacy and liberal education at Augsburg College, *Peer Review* 6 (4): 16-18.

** Schield M **2010 Assessing statistical literacy: take CARE, in P Bidgood, N Hunt, and F Joliffe (eds.)

*Statistical Education: An International Perspective*. Chichester: John Wiley & Sons: 133-152. (Excerpts can be accessed here).

**Schmit J** 2010 Teaching statistical literacy as a quantitative rhetoric course.

### References

**Ben-Zvi D and Garfield J** 2004 Statistical literacy, reasoning, and thinking: goals, definitions, and challenges, in D Ben-Zvi and J Garfield (eds.) *The Challenge of Developing Statistical Literacy, Reasoning, and Thinking*. Dordrecht: Kluwer Academic Publishers: 3-15.

**Denholm J, McLeod D, Boyes L, and McCormick J** 2003 *Higher Education: Higher Ambitions? Graduate Employability in Scotland*. Edinburgh: Scottish Higher Education Funding Council.

**Lankshear C and Knobel M** 2003 *New Literacies: Changing Knowledge and Classroom Learning*. Buckinghamshire: Open University Press.

**Purcell K and Pitcher J** 1996 *Great Expectations: The New Diversity of Graduate Skills and Aspirations. *Warwick: Institute for Employment Research.

**Wallman KK** 1993 Enhancing statistical literacy: enriching our society, *Journal of the American Statistical Association* 88 (421): 1-8.

## Using kernel densities to analyse film style

### 1. Introduction

Since a film typically comprises several hundred (if not thousands) of shots describing its style clearly and concisely can be challenging. This is further complicated by the fact that editing patterns change over the course of a film. Numerical summaries are useful but limited in the amount of information they can convey about the style of a film, and while two films may have the same median shot length or interquartile range they may have very different editing patterns. Numerical summaries are useful for describing the whole of a data set but are less effective when it comes to accounting for changes in style over time. These problems may be overcome by using graphical as well as numerical summaries to communicate large amounts of information quickly and simply. Graphs also fulfil an analytical role, providing insights into a data set and revealing its structure. A good graph not only allows the reader to see what is important about a data set the writer wishes to convey, but also enables the researcher to discover what is important in the first place.

It should be common practice in the statistical analysis of film style to include graphical summaries of film style (though this is rarely the case), and there are several different types of simple graphs that can be used. These include cumulative distribution functions, box-plots, vioplots, and time-ordered displays such as run charts and order structure matrices. In this post I describe two different uses of kernel density estimation as graphical methods for analysing film style. The next section introduces the basics of kernel density estimation. Section three discusses the use of kernel densities to describe and compare shot length distributions, while section four applies kernel densities to the point process of two RKO musicals to describe and compare how cutting rates change over time.

### 2. Kernel Density Estimation

The kernel density is a nonparametric estimate of the probability density function of a data set, and shows us the range of the data, the presence of any outliers, the symmetry of the distribution (or lack thereof), the shape of the peak, and the modality of the data (Silverman 1986; Sheather 2004). A kernel density thus performs the same functions as histogram but is able to overcome some of the limitations of the latter. Since no assumptions are required about the functional form of the data kernel densities are a useful graphical method for exploratory data analysis (Behrens & Yu 2003). The purpose of exploratory data analysis is to reveal interesting and potentially inexplicable patterns in data so that we can answer the general question ‘what is going on here?’ Kernel densities allows us to this by describing the relative likelihood a shot in a film will take on a particular value, or by allowing us to see how the density of shots in a film changes over time.

The kernel density is estimated by summing the kernel functions superimposed on the data at every value on the 𝑥*x*-axis. This means that we fit a symmetrical function (the kernel) over each individual data point and then add together the values of the kernels so that the contribution of some data point *x*_{i} to the density at *x* depends on how far it lies from *x*. The kernel density estimator is

where *n* is the sample size, *h* is a smoothing parameter called the *bandwidth*, and *K* is the kernel function. There are several choices for *K* (Gaussian, Epanechnikov, triangular, etc.) though the choice of kernel is relatively unimportant, and it is the choice of the bandwidth that determines the shape of the density since this value controls the width of the kernel. If the bandwidth is too narrow the estimate will contain lots of spikes and the noise of the data will obscure its structure. Conversely, if the bandwidth is too wide the estimate will be over-smoothed and this will again obscure the structure of the data. The kernel density estimate is an improvement on the use of histograms to represent the density of a data set since the estimate is smooth and does not depend on the end-points of the bins, although a shared limitation is the dependence on the choice of the bandwidth. Another advantage of the kernel density is that two or more densities can be overlaid on the same chart for ease of comparison whereas this is not possible with a histogram.

Figure 1 illustrates this process for *Deduce, You Say* (Chuck Jones, 1956), in which the density shows how the shot lengths of this film are distributed. Beneath the density we see a 1-D scatter plot in which each line indicates the length of a shot in this film (*x*_{i}), with several shots having identical values. The Gaussian kernels fitted over each data point are shown in red and the density at any point on the *x*-axis is equal to the sum of the kernel functions at that point. The closer the data points are to one another the more the individual kernels overlap and the greater the sum of the kernels – and therefore the greater the density – at that point.

All widely available statistical software packages produce kernel density estimates for a data set. An online module for calculating kernel densities can be found here.

### 3. Describing and comparing shot length distributions

A shot length distribution is a description of the data set created for a film by recording the length of each shot in seconds. Analysing the distribution of shot lengths in a motion picture allows us to answer questions such as ‘is this film edited quickly or slowly?’ and ‘does this film use a narrow or a broad range of different shot lengths?’ Comparing the shot length distributions of two or more films allows us to determine if they have similar styles: is film A edited more quickly than film B and does it exhibits more or less variation in its use of shot lengths? A kernel density estimate provides a simple method for answering these questions.

From the kernel density of *Deduce, You Say* in Figure 1 we see the distribution of shot lengths is asymmetrical with the majority of shots less than 10 seconds long. There is a small cluster of shots around 15 seconds in length, and there are three outliers greater than 20 seconds. From just a cursory glance at Figure 1 we can thus obtain a lot of information very quickly that can then guide our subsequent analysis. for example, we might ask what events are associated with the longer takes in this film?

**Figure 1** The kernel density estimate of shot lengths in *Deduce, You Say* (Chuck Jones, 1956) showing the kernel functions fitted to each data point (N = 58, Bandwidth = 1.356)

Suppose we wanted to compare the shot length distributions of two films. Figure 2 shows the kernel density estimates of the Laurel and Hardy shorts *Early to Bed* (1928) and *Perfect Day* (1929). It is immediately that clear though both distributions are positively skewed, the shot length distributions of these two films are very different. The density of shot lengths for *Early to Bed* covers a narrow range of shot lengths while that for *Perfect Day* is spread out over a wide range of shot lengths. The high density at ~2 seconds for *Early to Bed* shows that the majority of shots in this film are concentrated at lower end of the distribution with few shots longer than 10 seconds, while the lower peak for *Perfect Day* shows there is no similar concentration of shots of shorter duration and the shot lengths are spread out across a wide range (from 20 to 50.2 seconds) in the upper tail of the distribution. We can conclude that *Early to Bed* is edited more quickly than *Perfect Day* and that it shot lengths exhibit less variation; and though we could have come to these same conclusions using numerical summaries alone the comparison is clearer and more intuitive when represented visually.

**Figure 2** Kernel density estimates shot lengths in *Early to Bed* (1928) and *Perfect Day* (1929)

### 4. Time series analysis using kernel densities

Film form evolves over time and we can use kernel density estimation to describe the *cutting rate* of a film. Rather than focussing on the length of a shot (*L*) as the time elapsed between two cuts, we are interested in the timing of the cuts (*C*) themselves. There is a one-to-one correspondence between cuts and shot lengths, and the time at which the *j*th cut occurs is equal to the sum of the lengths of the prior shots:

Figure 3 shows the one-to-one nature of this relationship clearly.

**Figure 3** The one-to-one relationship between shot lengths (*L _{i}*) and the timing of a cut (

*C*)

_{j}Analysis of the cutting rate requires us to think of the editing of a film as a *simple point process* (Jacobsen 2006). A point process is a stochastic process whose realizations comprise a set of point events in time, which for a motion picture is simply the set of times at which the cuts occur. We apply the same method used above to the point process to produce a density estimate of the time series. Just as the density in the above examples is greatest when shot lengths are closer together, the density is greatest when one shot quickly follows another and, therefore, the shorter the shot lengths are at that point in the film. Conversely, low densities indicate shots of longer duration as consecutive shots will be distant from one another on the *x*-axis. This is similar to the use of peri-stimulus time histograms and kernel methods in neurophysiology to visualize the firing rate and timing of neuronal spike discharges (see Shimazaki & Shinamoto 2010).

Using kernel density estimation to understand the cutting rate of a film as a point process is advantageous since it requires no assumptions about the nature of the process. Salt (1974) suggested using Poisson distributions as a model of editing as a point process described by the rate parameter λ, but this method is unrealistic since homogenous Poisson point processes are useful only for applications involving temporal uniformity (Streit 2010: 1). For a motion picture the probability distribution of a cut occurring at any point in time is not independent of previous cuts, and the time series will often be non-stationary over the course of a film while also demonstrating acceleration and deceleration of the cutting rate because different types of sequences characterised by different editing regimes. We expect to see clusters of long and short takes in a motion picture and so the assumption of a Poisson process will not be appropriate, while the presence of any trends will mean that the process does not satisfy stationarity. Modelling the cutting rate as an inhomogeneous Poisson point process by allowing λ to vary as function of time may solve some – though not necessarily all – of these problems.

To illustrate the use of kernel densities in time series analysis we compare the editing of two films tow feature Fred Astaire and Ginger Rogers: *Top Hat* (1935) and *Shall We Dance* (1937). In order to make a direct comparison between the evolution of the cutting rates the running time of each film was normalised to a unit length by dividing each shot length by the total running time. In this case we treat slow transitions (e.g. fades, dissolves, etc) as cuts, with the cut between two shots marked at the approximate midpoint of the transition. Figure 4 shows the resulting densities.

From the plot in Figure 4 for *Top Hat* we see the density for this film comprises a series of peaks and troughs, but that there is no overall trend . The low densities in this graph are associated with the musical numbers, while the high densities occur with scenes based around the rapid dialogue between Astaire and Rogers. (See here for alternative time series analyses of *Top Hat* that use different methods but arrive at the same conclusions as those below).

The first musical number is ‘No Strings (I’m Fancy Free)’, which begins at ~0.07. Astaire is then interrupted when Rogers storms upstairs to complain about the racket, and we have a scene between the two in which both the dialogue and the editing are rapid. This occurs at the peak at ~0.11 to ~0.13, and is then followed by a reprise of ‘No Strings,’ which is again shot as a long takes. The next section of the film follows on the next day as Astaire takes on the role of a London cabby and drives Rogers across town and as before this dialogue scene is quickly edited resulting in a high density of shots at ~0.19. This sequence finishes with ‘Isn’t This a Lovely Day (to be Caught in the Rain),’ which accounts for the low density of shots at ~0.21 to ~0.27 since this number again comprises long takes. The rapid cutting rate during dialogue scenes is repeated when Rogers mistakes Astaire for a married man at the hotel, and is again followed by the low density of a slow cutting rate for the scenes between Astaire and Edward Everett Horton at the theatre and the number ‘Top Hat, White Tie and Tails’ at ~0.4. After this number the action moves to Italy and there is much less variation in the density of shots in the first part of these scenes, which are focussed on dialogue and narrative. There is no big musical number until ‘Cheek to Cheek’ and this sequence accounts for the low density seen at ~0.66, being made up of just 13 shots that run to 435.7 seconds. The density increases again as we move back to narrative and dialogue until we get to the sequence between in which Horton explains the mix-up over who is married and who is not to the policeman and ‘The Piccolino’ which begins at ~0.89 and runs until ~0.96.

The density plot of the point process for *Shall We Dance* differs from that of *Top Hat* showing a trend over the running time of the film from higher to lower densities of shots, indicating the cutting rate in this film slows over the course of the film. Nonetheless we see the same pattern of troughs and peaks, and as in *Top Hat* these are associated with musicals and comedy scenes, respectively.

This film features numerous short dancing excerpts in its early scenes, but there is no large scale musical number until well into the picture. In fact, these early scenes are mostly about stopping Astaire dancing (e.g. when Horton keeps turning off the record), and the dialogue scenes that establish the confusion over Astaire’s married status as the ship departs France. These scenes are based around a similar narrative device to that used in *Top Hat* and are again edited quickly. The first big number in the film is ‘Slap that Bass’ and coincides with the low density section of the film beginning at ~0.17, indicating that this part of the film is edited more slowly that the first section. The cutting rate slowly increases until ~0.37, and this section includes the ‘Walking the Dog’ and ‘I’ve Got Beginner’s Luck’ numbers but is mostly made up of dialogue scenes between Astaire and Rogers. After this point the film exhibits a trend from higher to lower densities and there are a number of smaller cycles present between 0.37 and 0.64. This section includes the numbers and ‘They All Laughed (at Christopher Columbus)’ and the subsequent dance routine, which begins at ~0.48 and includes the trough at ~0.54. The low density section beginning at 0.64 is the scene between Astaire and Rogers in which they try to avoid reporters in the park, and comprises a number of lengthy dialogue shots and the film’s most famous number, ‘Let’s Call the Whole Thing Off.’ The editing then picks up during the dialogue scenes until we reach the next drop in the density at ~0.74 which coincides with the scenes on the ferry to Manhattan as Astaire sings ‘They Can’t Take That Away From Me.’ The next low density section begins at ~0.9, and is the big production at the end of the film with the distant framing and static camera completing the long takes in showing off the ‘Hoctor’s Ballet’ sequence, which then gives way to a more rapidly cut section featuring numerous cut-ways from the dancers to Rogers’ arriving at the theatre with the court order for Astaire only to discover him on stage with dancers wearing masks of her face. The cutting rate then slows once more as Rogers insinuates herself into the ‘Shall We Dance’ routine and the film reaches its finale.

**Figure 4** Kernel density estimates of the point processes for two RKO musicals with normalised running times

Comparing the two plots we note some of the low density periods coincide with one another. This is most clearly the case at around 0.2 and 0.64 in both films. The major numbers that end the films also occur at similar points in the narratives. This indicates that a musical number occurs at approximately the same points in both films even though the two films have different running times (*Top Hat*: 5819.9s, *Shall We Dance*: 6371.4s). This raises some interesting questions regarding the structure of other musicals featuring Astaire and Rogers. Is there always a musical number about a fifth of the way into an RKO musical featuring this pair? Is there always a major number about two-thirds the way through picture? And does the finale always occupy the last 10 per cent of the picture? Answers to these questions will have to wait until I finish transcribing all the films Astaire and Rogers made for RKO in the 1930s.

### 5. Conclusion

Kernel density estimation is a simple method for analysing the style of motion pictures, and the wide availability of statistical packages makes the use of kernel densities easy to incorporate into empirical research. Since it requires no prior assumptions about the distribution of the data this method is appropriate for exploratory data analysis. In this paper we demonstrated the how this method may be used to describe and compare the shot length distributions of motion pictures and for the time series analysis of film style.

### References

**Behrens JT and Yu C-H** 2003 Exploratory data analysis, in JA Schinka and WF Velicer (eds.) *Handbook of Psychology: Volume 2 – Research methods in Psychology*. Hoboken, NJ: John Wiley & Sons: 33-64.

**Jacobsen M** 2006 *Point Process Theory and Applications: Marked Point and Piecewise Deterministic Processes*. New York: Birkhauser.

**Salt B** 1974 Statistical style analysis of motion pictures, *Film Quarterly* 28 (1): 13-22.

**Sheather SJ** 2004 Density estimation, *Statistical Science* 19 (4): 588-597.

**Shimazaki H and Shinamoto S** 2010 Kernel bandwidth optimization in spike train estimation, *Journal of Computational Neuroscience* 29 (1-2): 171-182.

**Silverman B** 1986 *Density Estimation for Statistics and Data Analysis*. London: Chapman & Hall.

**Streit RL **2010 *Poisson Point Processes: Imaging, Tracking, and Sensing*. Dordrecht: Springer.

## The editing structure of Follow the Fleet (1936)

This I look at the editing structure of the Fred Astaire-Ginger Rogers musical *Follow the Fleet* (1936). I looked at the structure of *Top Hat* in an earlier post, which you can find here. Figure 1 presents the order structure matrix of *Follow the Fleet*, in which white columns indicate shorter shots and darker patches represent clusters of longer takes. A spreadsheet with the raw data (from a PAL DVD and corrected by 1.4016) can be accessed here: Nick Redfern – Follow the Fleet. The opening and closing credits have not been included.

**Figure 1** Order structure matrix of *Follow the Fleet* (1936)

The editing of this film doesn’t show the same clear pattern of alternating between quicker and slower cut segments we see in *Top Hat*. *Follow the Fleet* is certainly cut much more slowly, with a median shot length of 7.5 seconds and an interquartile range of 10.4 seconds compared to Top Hat’s median of 5.5s and IQR of 7.2s. In the earlier film the different editing patterns were associated with musical numbers and comedy sequences, but *Follow the Fleet* lacks the comedy element. Randolph Scott is, I’m afraid to say, terribly dull in this film (and calling his character ‘Bilge’ doesn’t help). The spark between Astaire and Rogers that drives *Top Hat*, especially in the first section set in London, is missing here to and at nearly two hours long this film doesn’t hold the same interest. It somehow achieves the stunning feat of being both lacking in plot and predictable. There does not appear to be any particular trend over time in the editing structure, and this may be due to the high variability of shot lengths. The IQR noted above is much greater than appears to be typical for Hollywood films of the 1930s (or indeed any period), and so the time series in the order structure matrix looks relatively featureless.

Those features that do stand out in the matrix are those sequences comprising several longer takes and these are typically associated with the musical numbers. However, not all musical numbers are associated with such clusters. For example, ‘We saw the sea’ (shots 1-8) and Harriet Hillard singing ‘Get thee behind me Satan’ (shots 124-128) do not immediately jump out at you; while the dark column between shots 270 and 286 is ‘I’d rather lead a band,’ running to 351.1 seconds with its extended dance sequence on-board ship, is instantly recognisable.

‘Let yourself go’ appears several times throughout the film, making its bow with Rogers singing between shots 59 and 67, with the comic dance competition to this tune running from shots 132-150. These numbers are not associated with the sort of clusters of longer shots we see in the second half of the matrix, though they are generally slower than other sequences in the first 35 minutes of the film. Rogers’ solo tap dance audition is shot 317, and is followed by a cluster of short shots (319-325) when Astaire overhears how successful she is and decides to sabotage her singing audition. The subsequent disastrous reprise of ‘Let yourself go’ after Rogers’ drink has been spiked occurs at shots 333 to 338. Hillard singing ‘But where are you?’ begins and ends at shots 356 and 359, respectively, but this does not show up in the matrix as distinguishable from the shots around it.

The musical sequence featuring ‘I’m putting all my eggs in one basket’ begins at shot 416, with Astaire playing piano, and the number itself starts at shot 421 and runs until shot 428 for a total of 334.2 seconds. The most famous sequence from this film accounts for the cluster of long shots from 506 to 534, and includes ‘Let’s face the music and dance.’ The number itself only accounts for the last 2 shots running to 286.0 seconds.

Both *Top Hat* and *Follow the Fleet* were directed by Mark Sandrich, and David Abel was the cinematographer for both films. *Top Hat* was edited by William Hamilton, whereas *Follow the Fleet* was edited by Henry Berman. We do not know enough about RKO’s mode of production to determine how the working relationship between these and other filmmakers was structured, and so we will have to wait and see what the editing structure of other musicals in the Fred Astaire and Ginger Rogers series for the studio will tell us about the authorship of these films (if, indeed, there is any such person).

## Statistical illiteracy in film studies

UPDATE: The paper at the end of this post is now available for advance access at

Literary and Linguistic Computing, and can be cited as: The log-normal distribution is not an appropriate parametric model for shot length distributions of Hollywood films,Literary and Linguistic Computing, Advance Access published December 13, 2012, doi:10.1093/llc/fqs066. I will put up the paginated reference when the print version is released.

This week’s post combines two very different approaches to film studies: on the one hand we have outright anger, and then we have proper research. Both are equally important.

## Are you *au fait* with this?

I wrote a second version of my paper examining the impact of sound technology on shot length distributions of Hollywood films using a larger sample of films. I also expanded on the methodology used (Mann-Whitney U, probability of superiority, etc.) since this has been highlighted as a problem before. (The original version is here). Having finished the article I sent it to *The New Soundtrack* at Edinburgh University Press. The article was turned down 24 hours later, and the reason given for rejecting the article was that, in the editors opinion,

our readership might not be quite

au faitwith the methodology you describe in the piece.

Nothing about the quality of the piece; just the lack of confidence the editors have in their readership.

What sort of intellectual cowardice is this? Are film scholars afraid of learning new things? Or is it that journal editors have such a low opinion of their readership that they need to protect them from anything that might be new or unusual ? Does the readership of *The New Soundtrack* really not know what a ‘median’ is? Is there no sense of intellectual discovery?

If I was part of the readership of *The New Soundtrack* I would be very unhappy with this. Presumably, if I am a subscriber to an academic journal I am (or at least consider myself to be) a reasonably intelligent person capable of thinking and learning for myself. (Perhaps I am part of the sophisticated readership of *Screen* as well). Do I really need someone to decide for me what I might or might not be *au fait* with? Now I’m wondering what other research I’ve missed out on because an editor has decided what I might or might not be comfortable with.

Have you ever heard anything so pathetic? I have, and this is now the third time I have had a journal reject one of my articles because of the use of statistical methods (see here).

## Statistical literacy

Statistical literacy is defined as

the ability to understand and critically evaluate statistical results that permeate our daily lives – coupled with the ability to appreciate the contributions that statistical thinking can make in public and private, professional and personal decisions (Wallman 1993: 1).

This is relevant to film studies because we encounter statistical information in diverse contexts. Statistics is relevant in film and television studies in the study of film style, in researching the economics of the film industry, in audience studies, and in scientific research on cognition and perception in the cinema. Understanding a great deal of research of film studies assumes that you have at least some degree statistical literacy.

Gal (2002) argues that statistical literacy comprises two elements:

- a
*knowledge*component, in which individuals have the ability to ‘interpret and critically evaluate statistical information, data-related arguments, or stochastic phenomena which they may encounter in diverse contexts’ (2); and - a
*dispositional*component, in which individuals develop a questioning attitude to research that purports to be based on data, a positive view of themselves as ‘individuals capable of statistical and probabilistic reasoning as well as a willingness and interest to “think statistically” in relevant situations’ and ‘a belief in the legitimacy of critical action even if they have not learnt much formal statistics or mathematics’ (19).

Arguably the dispositional component is the most important since the willingness to the think statistically is a pre-requisite for learning statistical concepts.

It is clear that the editors of *The New Soundtrack* have concerns about the statistical literacy of their readership. The editors apparently assume their readership will not have the required statistical knowledge to understand research presenting statistical analysis of data, and – much more damaging – they do not believe their readership has the capability or willingness to think statistically.

Altman (2002) notes readers assume that articles published in peer-review journals are scientifically sound. But in order to make an informed interpretation of the material that appears in peer-reviewed sources we need to be able to intelligently interpret it. This means that statistical literacy is a must for film studies, and it is a topic we will return to repeatedly over the rest of the year. In the next section I demonstrate how knowledge of statistical concepts and process and a questioning attitude are essential in judging the importance of research in film studies.

## The lognormal dragon is slain (again)

An example of the importance of developing statistical literacy in film studies comes in the form of a new book to be published this year featuring a chapter by Jordan De Long, Kaitlin L. Brunick, and James E. Cutting. The link to an online version of this paper is below Figure 1. I won’t explain the statistical concepts in detail, but I have provided links for statistical terms and concepts. I will assume you are an intelligent reader capable of and willing to learn for yourself.

In their chapter on film style, the authors make the following statement about the average shot length (ASL) as a statistic of film style:

Despite being the popular metric, ASL may be inappropriate because the distribution of shot lengths isn’t a normal bell curve, but rather a highly skewed, lognormal distribution. This means that while most shots are short, a small number of remarkably long shots inflate the mean. This means that the large majority of shots in a film are actually below average, leading to systematic over-estimation of individual film’s shot length. A better estimate is a film’s Median Shot Length, a metric that … provides a better estimate of shot length.

In support of this statement they include a graph that purports to show how the shot length distribution of one film is lognormal. This is the only piece of evidence they provide.

Figure 1 Histogram of shot lengths in *A Night at the Opera* with a fitted lognormal distribution from **De Long J, Brunick KL, and Cutting JE** 2012 Film through the human visual system: finding patterns and limits, in JC Kaufman and DK Simonton (eds.) *The Social Science of Cinema*. New York: Oxford University Press: in press. This graph was downloaded from the online version of this paper available at http://people.psych.cornell.edu/~jec7/pubs/socialsciencecinema.pdf.

Clearly, the authors have assumed their readership has a fairly sophisticated level of statistical literacy. They present their argument assuming you will be able to understand it or be capable of learning the relevant concepts. An entirely reasonable way in which to present an argument in a research output, and presumably an attitude that comes from their scientific (rather than film studies) background.

It’s just a shame it’s not true.

The key fact to bear in mind is that a variable (such as shot length) is said to be lognormally distributed if its logarithm is normally distributed, as this allows us to apply a logarithmic transformation and then to try to determine if it is normally distributed.

Figure 2 presents an exploratory data analysis of the data for this film, which can be accessed at here.

**Figure 2** Exploratory data analysis of shot lengths in A *Night at the Opera*

In the top left panel we see the histogram of the log-transformed shot lengths and it is immediately obvious this data set is not normally distributed. If De Long, Brunick, and Cutting are right, then this chart should be symmetrical about the mean. The histogram remains skewed even after the transformation is applied. The same pattern can be seen from the kernel density estimate (top right), which is clearly not symmetrical.

The normal probability plot (bottom left) shows the same pattern. If the data does come from a lognormal distribution then the points in this plot will be a straight line. In fact, they will lie along the red line shown in the plot. It is obvious that this is not the case and that the data points show clear evidence of a skewed data set. In the lower tail of the fitted lognormal distribution underestimates the number of shorter takes, while the upper tail overestimates the number of longer takes. Definitively NOT lognormal.

Finally, the box plot (bottom right) clearly shows the distribution is asymmetrical with outliers in the upper tail of the distribution. This is a good example of the fat that log-transforming does not always remove the skew from a data set of deal with the problem of outliers.

The marks below histogram, kernel density, and box plot (called a *rug*) indicate the actual values of the log-transformed shot lengths.

We can also apply formal statistical tests of the hypothesis that the shot length distribution is lognormal. Because a variable is lognormally distributed if its logarithm is normally distributed, then all we have to do is to apply normality tests to the transformed data.

The Shapiro-Francia test is based on the squared correlation of the theoretical and sample quantiles in the probability plot in Figure 2. For this film, the test statistic is 0.9585 and p = <0.01, so it is extremely unlikely that this data comes from a lognormal distribution and we have sufficient evidence to reject this hypothesis.

The Jarque-Bera test does the same thing in a different way. This test looks at the skew (its symmetry) and the kurtosis (the shape of its peak) of the data. For *A Night at the Opera*, the result of this test is 62.48 (p = <0.01) and again we have sufficient evidence to reject the hypothesis that this data comes from a lognormal distribution.

In summary, De Long, Brunick, and Cutting present a single piece of evidence in support of the assertion that shot length distributions are lognormal and its wrong. In fact, if you wanted to write a book about how shot length distributions are not lognormal and wanted to put an example of this on the cover then *A Night at the Opera* would be the film you would use.

Clearly, there is a problem with the histogram in Figure 1 that shows the shot length data on an untransformed scale. The reason for applying a logarithmic transformation is to make it easier to see the structure of the data, so why not view it on a logarithmic scale? When we view the data on a logarithmic scale we come to the opposite conclusion than that presented above. It requires statistical literacy on the part of the reader to question if this is an appropriate way of presenting data and to question the interpretation presented by the authors.

Obviously we cannot say that just because we can show that one film is not lognormally distributed this is true for all films. In order to properly assess the validity of De Long, Brunick, and Cutting’s assertion we need to test a sample of films representing a defined population, and this is precisely what I have done. The following paper demonstrates that the argument that the lognormal distribution is not an appropriate parametric model for the shot length distributions of Hollywood films:

Nick Redfern – The lognormal distribution and Hollywood cinema

AbstractWe examine the assertion that the two-parameter lognormal distribution is an appropriate parametric model for the shot length distributions of Hollywood films. A review of the claims made in favour of assuming lognormality for shot length distributions finds them to be lacking in methodological detail and statistical rigour. We find there is no supporting evidence to justify the assumption of lognormality in general for shot length distributions. In order to test this assumption we examined a total of 134 Hollywood films from 1935 to 2005, inclusive, to determine goodness-of-fit of a normal distribution to log-transformed shot lengths of these films using four separate measures: the ratio of the geometric mean to the median; the ratio of the shape factor

σto the estimatorσ*=√(2*ln(m/M); the Shapiro-Francia test; and the Jarque-Bera test. Normal probability plots were also used for visual inspection of the data. The results show that, while a small number of films are well modelled by a lognormal distribution, this is not the case for the overwhelming majority of films tested (125 out of 134). Therefore, we conclude there is no justification for claiming the lognormal distribution is an adequate parametric model of shot length data for Hollywood films, and recommend the use of robust statistics that do not require underlying parametric models for the analysis of film style.

Placing this paper alongside my earlier posts testing the lognormality of shot length distributions for Hollywood films prior to 1935 (see here), we can now finally conclude there is no evidence to justify for assuming this model for Hollywood films in general.

## References

**Altman DG** 2002 Poor-quality medical research: what can journals do?, *Journal of the American Medical Association* 287 (21): 2765-2767.

**De Long J, Brunick KL, and Cutting JE** 2012 Film through the human visual system: finding patterns and limits, in JC Kaufman and DK Simonton (eds.) *The Social Science of Cinema*. New York: Oxford University Press: in press.

**Gal I** 2002 Adults’ statistical literacy: meanings, components, responsibilities, *International Statistical Review* 70 (1): 1-51.

**Wallman KK** 1993 Enhancing statistical literacy: enriching our society, *Journal of the American Statistical Association* 88 (421): 1-8.

## Correspondence analysis of genre preferences in UK film audiences

UPDATE: this piece has now been published as Correspondence Analysis of Genre Preferences in UK Film Audiences, Participations 9 (2) 2012: 45-55. The article can be downloaded here.

UPDATE: I’ve now done a similar analysis for genre preferences in UK television audiences using data from the same BFI study, which you can find on this blog post.

Genre provides viewers with a first reference point for a film, and functions as a ‘quasi-search’ characteristic through which audiences assess product traits without having seen a particular film (Hennig-Thurau *et al*. 2001). In a market place comprising a large number of unique cultural products with no unambiguous reference brand, audiences form experience-based norms at the aggregate level of genre rather than the specific level of individual films (Desai & Basuroy 2005). Consequently, genre is the means by which the film industry alerts viewers that pleasures similar to those previously enjoyed are available without compromising the need for novel products; and empirical research has shown that genre is an important factor – if not the most important – in audiences’ decision making about which film to see (Litman 1983, Da Silva 1998).

Understanding audience preferences for certain types of films is therefore a priority for film producers and distributors as this will be a factor in deciding which films to produce and how to market them effectively. In this short paper we analyze the genre preferences of UK film audiences, applying correspondence analysis to data produced by the British Film Institute’s research into the cultural contribution of film in the UK. Specifically, we focus on how genre preferences vary with gender and age when treated as a single composite variable.

### The BFI dataset

In July 2011, the British Film Institute (BFI) published a report, *Opening Our Eyes* (Northern Alliance/Ipsos Media CT 2011), examining the cultural contribution of film in the UK [1]. This report analysed how audiences consume films and attitudes to the impact of film based on a series of qualitative ‘paired depth’ interviews and an online survey of 2036 UK adults aged between 15 and 74.

Question C.1 in the questionnaire invited respondents to express preferences for their favourite genres/type of films from a list comprising action/adventure, animation, art house/films with particular artistic value, comedy, comic book movie, classic films, documentary, drama, family film, fantasy, foreign language film, horror, musicals, romance, romantic comedy, science fiction, suspense/thriller, other, none, and don’t know. Respondents were able to select as many genres as they wished, and the data represents the number of respondents expressing a preference for that genre. Figure 7 in the final report presents the breakdown of genre preferences by gender, concluding that male audience members exhibit stronger preferences for science fiction, action/adventure, and horror films while women preferred romantic comedies, family films, romances, and musicals [2]. In an additional detailed summary made available online, genre preferences were broken down by age group. These results showed younger respondents were more likely select comedy, horror, animation, and comic book as their favourite genres, whereas older audience members were more likely to select dramas, documentaries, and classic films.

The report did not present any findings regarding genre preferences based on the combination of the gender and the age of the subjects, and it is this interaction analysed here. In addition to publishing the final report the BFI has made the full set of result tables from the quantitative survey available to researchers freely online. Table 416 of this output contains the data on gender, age, and genre preferences, and is the basis for our correspondence analysis. We use nineteen of the categories listed above, with ‘don’t knows’ excluded from the analysis. Table 416 lists the additional genre categories of westerns, historical, war, and gangster films, and these have been included in the category ‘other.’

### Correspondence analysis

Correspondence analysis (CA) is a multivariate technique for exploring and describing frequency data defined by two or more categorical variables in a contingency table. By calculating chi-square distances between the row and column profiles in a table, CA determines the (dis)similarity of the reported frequencies. CA aims to reveal the structure inherent in the data, and does not assume an underlying probability distribution. Consequently, CA requires that all of the relevant variables are included in the analysis and that the entries in the data matrix are nonnegative, but makes no other assumptions. CA does not support hypothesis testing, and cannot be used to determine the statistical significance of relationships between variables. Here we describe the outputs of the correspondence analysis and their interpretation, and the reader can find introductions to the theory and mathematics of CA in Clausen (1998), Beh (2004), and Greenacre (2007).

The first output of the correspondence analysis is a table describing the variation in the contingency table, referred to as the *inertia*. The total inertia in the table is equal to the chi-square statistic divided by the total sample size: Φ² = χ²/*N*. This variation is decomposed into the principal inertias of a set of dimensions, each accounting for a percentage of the total inertia. For an *r* × *c* table, the maximum number of dimensions is min(*r*-1, *c*-1). The number of dimensions retained for analysis is based on the first *k* dimensions to cumulatively exceed a threshold (typically 80 or 90 per cent of the total inertia), all those individual dimensions accounting for more than 1/(min[*r*, *c*] – 1)% of the total inertia, or by reference to a scree plot of the inertias to determine where the drop in the percentage accounted for by a dimension drops away less rapidly. It is also dependent on our ability to give a meaningful interpretation to the dimensions selected. In selecting only a subset of the available we lose some of the information contained in the original table, but in discarding some dimensions we are able to see structure of the data more clearly for as little cost as possible.

As a form of geometric data analysis, correspondence analysis enables the information in a contingency table to be represented as clouds of points in low-dimensional graphical displays (see Le Roux & Rouanet 2005, Greenacre 2010: 79-88). The origin of the graph represents the average row (column) profile, and by assessing the distance of points from the centroid of the clouds we describe the variation within the table and their similarity. Row (column) points that lie close to the origin are similar to the average profile of the row (columns). Data points that lie far from the origin indicate categories for which the observed counts differ from the expected values under independence and account for a larger portion of the inertia. Points from the same data set lying close together represent rows (columns) that have similar profiles, and data points that are distant from one another indicate that the rows (columns) are remote. The distance between row points and column points cannot be interpreted as meaningful as they do not represent a defined quantity. The angle (θ) subtended at the origin defines the association between row and column points: when the angle is acute (θ < 90°) points are interpreted as positively correlated, points are negatively correlated if the angle between them is obtuse (θ > 90°), and points that subtend a right angle (θ = 90°) are not associated (Pusha *et al*. 2009).

In addition to the graphical displays, a detailed numerical summary of the correspondence analysis is produced. The *mass* of a row (column) indicates the proportion accounted for by that category with respect to all the rows (columns), and is simply the row (column) total of divided by the total sample size; while the *inertia* of a data point is its contribution to the overall inertia. The *squared correlation* describes that part of the variation of a data point explained by a particular dimension. The *quality* of a data point measures how well it is represented by the graph, and is equal to the sum of the squared correlations of the dimensions retained for the analysis. The higher the quality of a data point the better the extracted dimensions represent it, and ranges from 0 (completely unrepresentative) and 1 (perfectly represented). The *absolute contribution* of a data point describes the proportion of the inertia of each dimension it explains, and is determined by both the mass of the data point and its distance from the centroid.

### Gender, age, and genre preferences

Table 416 of the BFI’s results output presents counts of genre preferences sorted by gender, by age, and by gender and age. As our interest lies in the variation of genre preferences (19 categories) among UK audiences based on both gender and age we use only this last part of the table, treating ‘gender-age’ as an interactively coded variable with 10 categories combining all the levels of the variables gender (2 categories) and age (5 categories) (Greenacre 2007: 121-128). We apply correspondence analysis to this table using the **ca** package (version 0.33; see Nenadić & Greenacre 2007) in **R** (version 2.13.0).

Table 1 presents the 10 × 19 cross-tabulation of ‘gender-age’ with genre. The chi-square statistic for this table is 1312.28 (*N* = 13086, df = 162, *p* = <0.01), and we therefore conclude that there is a statistically significant association between gender-age and genre preferences for UK film audiences. However, there is only a weak correlation between ‘gender-age’ and genre preference, with just 10% of the variation in Table 1 due to dependence: Φ² = χ²/*N* = 1312.28/13086 = 0.1003.

**Table 1** Cross-tabulation of interactively-coded gender-age variable with genre. Cell counts represent the number of respondents in each group expressing a preference for a genre. Source: BFI/Northern Alliance/Ipsos Media CT. Click on the table to see it full size.

Table 2 shows the principal inertias, percentages, and cumulative percentage of each dimension, with a scree plot of the inertias. The first two dimensions account for 90.6 per cent of the inertia and the scree plot flattens out after the second dimension. Consequently, these dimensions were retained for analysis and the remainder were discarded.

**Table 2** Principal inertias of the correspondence analysis applied to Table 1 explained by dimensions with scree plot

Figure 1 is the resulting symmetric map based on these two dimensions. Tables 3a and 3b present the detailed numerical summary of the results for the rows (gender-age categories) and columns (genre categories), respectively. Click on the graph to see it full size.

**Figure 1** Symmetric correspondence analysis map of interactively coded ‘gender-age’ cross-tabulated with genre for UK film audiences

**Table 3a** Detailed numerical summary of correspondence analysis by gender-age. Click on the table to see it full size.

**Table 3b** Detailed numerical summary of correspondence analysis by genre. Click on the table to see it full size.

From Table 3a and Figure 1 we see a clear horizontal separation between the male and female respondents, with points arranged vertically by age group from youngest to oldest within each gender category. Consequently, we interpret the principal axes in terms of the rows of Table 1, with the first dimension understood as gender and the second dimension as age. As gender accounts for 64.3 per cent of the total inertia compared to 26.3 per cent for age, this factor is dominant and explains the major part of the variation in Table 1. The quality for the gender-age groups is high (see Table 3a), and these factors are well represented in two dimensions. The points for all gender-age groups are distant from the origin, indicating that no group is close to the average profile in either dimension and that all the groups contribute to the overall inertia.

From Figure 1 we see the distance between the points representing male audience members greater as the age of the respondents increases. The points for males aged 15-24 and 25-34 are very close indicating they have similar row profiles and, therefore, similar genre preferences. The two middle-aged groups are distant from both the youngest and the oldest, while also being remote from one another. Males over the age of 55 are remote from the other age groups, indicating that their genre preferences are substantially different from those of younger male audience members. The points representing female respondents show a similar pattern with the middle-aged groups distant from both youngest and oldest and with over 55s are remote from younger female audience members in their preferences. The greatest contrasts in genre preferences are observed when taking gender and age together: females over 55 are most different from males aged 15-24, and males aged 55+ are most different from young women.

A key difference between audience groups is how the importance of the factors of gender and age vary in explaining their genre preferences. Age becomes increasingly important in the representation of the points for male audience categories. The squared correlations for the three youngest male groups are greatest for dimension 1, indicating that their gender is more important in explaining their preferences than age; for males aged 45-54 gender is still the dominant component albeit to a lesser extent than younger cohorts and the influence of age becomes more apparent in the raised squared correlation for dimension 2; while for males aged 55+ age is the dominant factor. This pattern is not evident for female respondents, and looking at the squared correlations in Table 3a we see the opposite pattern to male audience members. The squared correlations for women aged 35-44, 45-54, and 55+ are dominated by the dimension of gender, whereas age is the main factor for the two youngest groups. However, it should be noted that for the females aged 15-24, gender does contribute substantially to the representation of this point.

Although the correlation between gender-age and genre preference is low, it is clear from these results that the variation within Table 1 is highly structured in terms of the gender and age of the respondents. Describing the preferences of UK cinemagoers therefore requires taking *both* these factors into account and failure to do so leads to much useful information being obscured. The headline percentages reported by the BFI give only a partial picture of the genre preference of UK film audiences that fails to adequately capture that structure.

Turning to the genre categories themselves we see that the quality of these points is high (see Table 3b), indicating they are well represented in two dimensions and that gender and age are good predictors of the genre preferences of UK audiences. However, we note the quality of the representation for foreign (0.41) and art-house (0.14) films by these two dimensions is very low. This indicates gender and age do not explain variation in audience preferences for these types of films, and that some other factor should be considered. Based on other data available in the BFI’s results output, level of educational attainment is a better predictor of audience preference for these types of films: Table 20 of the results output cross-tabulates level of education and type of film most often watched, with 68 per cent of respondents selecting foreign language films educated to degree level. These two categories are typically applied to films to distinguish them from mainstream cinema (i.e. Hollywood films), and may not function as genre labels in the same context as terms such as ‘comedy,’ ‘drama,’ etc.

The quality of the categories ‘other’ and ‘none’ are also much lower than the mainstream genres, but as these points represent indistinct categories we do not discuss them further.

Gender is the most important factor in determining genre preference, with the cloud of points representing genres orientated along the first principal axis. Family films, romance, and romantic comedies are all associated with female audiences. In fact, 83 per cent of respondents to express a preference for romance films were female, and the corresponding figures are also high for family films (64%) and romantic comedies (72%). Musicals are also strongly associated with female audiences (71%), but this category is dominated by over 55s: over a quarter of respondents expressing a preference for this genre are in this age group. Drama also lies along the same direction as females over 55 indicating that this group is associated with this genre, but the distance from the origin is smaller reflecting a smaller effect. The proportion of males over 55 selecting drama films as a preferred genre is also greater than younger male viewers, but not to the same extent as their female counterparts. In fact, female viewers in each age group expressed a stronger preference for drama films than male viewers of the same age.

Genres associated with male audiences tend to be action-based and technology-driven. Of respondents expressing a preference for science fiction films, 65 per cent were male and there is little variation between age groups within this gender category. Consequently, this genre is very well represented by the first principal axis and age is not a significant factor. This is also the case for action/adventure films (58%), albeit it to a lesser degree as this point lies nearer the origin. Comic book, fantasy, and horror films are strongly correlated with male audiences, and lie along the same direction as males aged 15-24 and 25-34 indicating that age also a key factor here. The squared correlations for gender are the dominant factors for these genres, but age also contributes a substantial part of these points’ representation.

It is interesting that genres we associate with male audiences appear to have broader appeal than genres we associate with female audiences. Dividing the cells by the column totals to give the proportion of respondents in each gender-age group expressing a preference for a genre, we see that no male age group accounts for more 4 per cent of the total for romance films compared to the very large proportion for female audiences noted above. Although female associated, family films do not show the extreme divide as romance films, romantic comedies, and musicals. For science fiction films, the female respondents account for a total of 35 per cent of the expressed preferences for this genre, with each age group within this gender category contributing between 5 and 8 per cent of the total. This is also the case for comic book and action/adventure films. We conclude that so-called ‘female genres’ hold very little appeal to male audiences; and that while similar patterns are certainly evident for ‘male genres’ the effect is much smaller.

Three genres show high squared correlations with age. In all the cases the contribution of the first principal axis is small, and we conclude that gender is relatively unimportant in explaining audience preferences for these films. Animation is associated with under 35s, though female viewers aged 35-44 account 13 per cent of the column total in Table 1 possibly due to selecting these films for family viewing. Documentaries and classic films are associated with over 55s. Of those expressing a preference for documentaries, 18 per cent were males over 55 and 17 per cent were females in the same age group. There is no specific trend among the other age groups, which show roughly equal levels of interest in these films. It is noticeable that proportion selecting classic films increases with age, though this may reflect the aging of the audience rather than a clear genre preference as the new films of one’s youth become classics with time.

Two genres – comedy and suspense/thriller – lie near the origin. These points also have the lowest quality of the mainstream genres, though both are still well represented in Figure 1. Both dimensions contribute to the representation of these points, indicating that gender and age are relevant factors. Gender makes a larger contribution to comedy than age, with males under 35 slightly more likely to express a preference for this genre than males over 35 or female viewers; while for suspense/thrillers over 55s of both genders account for slightly greater proportion of the preferences expressed for this category. However, it is their closeness to the average profile that is most informative about these points, indicating that all gender-age groups enjoy these types of films. This does not mean that they are watching the same films *within* these genres – it is very unlikely males aged 15-24 are watching the same comedy films, for example, as women over 55; but the BFI’s data cannot help us to explore this aspect.

### Conclusion

This study analyzed the genre preferences of British film audiences. We have replicated the results originally presented by the BFI, and have extended them to reveal additional patterns in the data. Correspondence analysis enables us to obtain an overview of how different sections of the audience for films in the UK relate to one another, and to assess the relative importance of different factors in explaining the variation among audiences and their genre preferences. The study showed that gender is the dominant factor in determining audience preferences, with age an important but secondary factor. Most genres can be identified as either ‘male’ or ‘female’ with clear age profiles evident within gender categories, though preferences for animated films, classic movies, and documentaries are determined by age alone. These factors do not adequately explain variation among audiences when applied to categories of films that lie outside mainstream cinema.

### Notes

1.The report, the research questionnaire, the detailed summary, and the full set of result tables are available at http://www.bfi.org.uk/publications/openingoureyes/, accessed 21 November, 2011.

2. The report also presents results based on respondents’ ethnic minority but these will not be discussed here.

### References

**Beh EJ** 2004 Simple correspondence analysis: a bibliographic review, *International Statistical Review* 72 (2): 257-284.

**Clausen S-E** 1998 *Applied Correspondence Analysis: An Introduction*. Thousand Oaks, CA: Sage.

**Da Silva I** 1998 Consumer selection of motion pictures, in BR Litman (ed.) *The Motion Picture Mega-industry*. Boston: Allen and Bacon: 144-171.

**Desai KK and Basuroy S** 2005 Interactive influence of genre familiarity, star power, and critics’ reviews in the cultural goods industry: the case of motion pictures, *Psychology and Marketing* 22 (3): 203-223.

**Greenacre M** 2007 *Correspondence Analysis in Practice*, second edition. Boca Raton, FL: Chapman & Hall/CRC.

**Greenacre M** 2010 *Biplots in Practice*. Bilbao: Fundación BBVA.

**Hennig-Thurau T, Walsh G, and Wruck O** 2001 An investigation into the factors determining the success of service innovations: the case of motion pictures, *Academy of Marketing Science Review* 6: http://www.amsreview.org/articles/henning06-2001/pdf, accessed 24 May 2011.

**Le Roux B and Rouanet H** 2005 *Geometric Data Analysis: From Correspondence Analysis to Structural Data Analysis*. Dordrecht: Kluwer Academic Publishers.

**Litman BR** 1983 Predicting success of theatrical movies: an empirical study, *Journal of Popular Culture* 16 (4): 159-175.

**Nenadić O and Greenacre M** 2007 Correspondence analysis in R, with two- and three-dimensional graphics: the ca package, *Journal of Statistical Software* 20 (3), http://www.jstatsoft.org/v20/i03/paper, accessed 6 September 2011.

**Northern Alliance/Ipsos Media CT** 2011 *Opening Our Eyes: How Film Contributes to the Culture of the UK*, July 2011.

**Pusha S, Gudi R, and Noronha S** 2009 Polar classification with correspondence analysis for fault isolation, *Journal of Process Control* 19 (4): 656-663.