# Some notes on cinemetrics IV

In the 1970s, Barry Salt proposed that the mean shot length could be used to describe and compare the style of motion pictures. Many other scholars have followed him, and we find now that average shot lengths are now commonly cited in film studies texts. Unfortunately, a worse choice of a statistic of film style could not have been made – the distribution of shot lengths is not normally distributed and the mean does not accurately locate the middle of the data. This means that a large part of film studies research is utterly useless because it is based on an elementary mistake in the methodology that could have been avoided with only a middle school maths education. Quite simply, the mean is not an appropriate measure of location for a skewed dataset with a number of outliers. It never has been; it never will be; and quoting this as a statistic of film style leads to fundamentally flawed inferences about film style, as can be seen here.

This does not mean tha Salt has decided to give up on the mean shot length. He has subsequently asserted – but not proven – that shot length distributions are lognormally distributed, and that the mean shot length should be retained because the ratio of the mean shot length to the median shot length can be used to derive the shape factor of a lognormal distribution that adequately describes the distribution of shot lengths in a motion picture. (Actually Salt refers to the median-to-mean ratio, but this is just a different way of writing the same information – each ratio is reciprocal of the other. For convenience in later calculations I refer only to the mean-to-median ratio). The ratio of the mean to the median is a measure of the skew of a dataset – symmetrical distributions have a ratio of approximately 1 – and is used widely in economics to represent imbalances in income. *If a distribution is lognormal*, there is a relation between the mean-to-median ratio and the shape factor of a lognormal distribution. As I have shown elsewhere on this blog, the assumption of lognormality is not justified – applying a normality test to the log-transformed data I have found that the null hypothesis of lognormality is rejected in between 50% and 80% of cases. The proportion of silent films for which this null hypothesis is rejected appears to be greater than the proportion of sound films.

Undeterred, Salt persists with the assertion that shot lengths are lognormally distributed and has cooked up a new scheme to justify this assertion by arguing that titles should be removed from the shot length data of silent films and then analysed as being lognormal. No suggestion is made regarding the seemingly large proportion of sound films that also do not appear to be lognormally distributed. As is typical in Salt’s work, this argument is simply asserted as being true without any methodological justification and – as we shall see – some dubious evidence.

What is the methodological justification for removing the titles from the shot length data? Possible reasons for removing this data are that the titles are not original and have been updated so that they no longer accurately reflect the original structure of the film. However, the fact that the titles may not be original does not automatically mean that the titles are inaccurate or that their time on screen is not an accurate reflection of the original tempo of the film. It may be that a conservator has meticulously restored the film and respected the way the film was originally put together. We should certainly feel free to include the titles in the data if they are or are known to be properly restored, are based directly on the original film, or are reasonable approximations based on documentary evidence for the film’s production, historical context, etc. Salt’s suggestion appears to be a blanket ban on all titles in shot length data for silent films, but this would rule out much otherwise useful data. A further appears in the memoir of the projectionist Louis J Mannix (whom I discussed in an earlier post), who noted that it was a practice of projectionists to slow the film when a title came onto the screen for the ease of reading by the audience – there is nothing we can do as statisticians to control for this type of situation specific variability but it is very interesting as film history. The use of titles is certainly a methodological concern for analysts of film style, and it does need to be discussed as part of the methodology of the statistical analysis of film style. This would, however, mean going beyond mere assertion.

Salt’s method involves linking two shots that were previously separated by a title into a single shot, but again there is no methodological justification for this. The decision to put a title in the middle of a shot is itself an aesthetic decision by the filmmakers for the purposes of narrative communication, and should be respected as such. If we combine the shots in the manner Salt suggests can the data be said to reflect the film as it was made? The tempo of the film is changed, and we can no longer make any direct comparison between silent films, and between silent films and sound films. Salt also states the resulting analysis will provide very different results if the shots are not combined in this way, but he does not say why we should prefer his method over the alternative of not combining the shots.

Separating titles from the rest of the shot length data for a film is not in itself a bad idea – it would allow us to look more closely at how a film was put together, and to make inferences about how audiences understand silent films or text on screen in general. However, Salt appears to want to remove this data to make it fit a lognormal distribution, and that is a bad idea. It is back to front: the transformation of the data is suggested to make it fit a preconceived theoretical distribution, even though there is no evidence that this assumption is justified in general. If the method of combining shots is to be preferred to not combining them for the purpose of generating a better lognormal fit, then this is clearly problematic. In the absence of a proper methodological basis, this smacks of both desperation and data manipulation. Nonetheless, Salt has stated that this approach can be termed ‘experimental film analysis’ similar to experimental archaeology. The whole thing can be read here.

*Little Annie Rooney* has been held up of an example of how the fit to a lognormal distribution is improved after removing the titles. The data for this film (without titles) is here. However, closer examination of the data reveals that the mean-to-median ratio leads to a poor estimate of the shape factor and provides a substantially poorer fit than the maximum likelihood estimates (MLE). Recalling that a random variable X (such as the length of a shot) is lognormally distributed if its logarithm is normally distributed, Figure 1 presents the histogram of the shot length data transformed using the natural logarithm and three density estimates.

**Figure 1** Density estimation of shot lengths for *Little Annie Rooney* (minus titles)

The red curve is the kernel density estimate, using an Epanechnikov kernel and a bandwidth of 0.5, and is a nonparametric density estimate that makes no assumption about the shape of the distribution and depends on the data alone. This is the empirical distribution of the log-transformed data, and is used as a part of exploratory data analysis. From the histogram and the kernel density estimate we can see that even after the data has been log-transformed there is still some skewness and a heavy upper tail. We should therefore be sceptical about the assertion that this data is lognormally distributed. (For a kernel density calculator see here).

The black curve is the normal distribution specified by the maximum likelihood estimators of the log-transformed shot lengths – i.e. the mean (μ) and standard deviation (σ) of the logarithms of the shot lengths. (Note that μ is the arithmetic mean of the log-transformed data and the geometric mean of the data in its original scale). For this data, μ = 1.2078 and σ = 0.7304. The probability plot correlation coefficient (PPCC) using a Blom plotting position is 0.9776 and the null hypothesis that the data (n = 1066) is lognormally distributed is rejected for α = 0.05. Figure 2 is the normal probability plot for this data with the parameters of the black curve. (Recall that if the lognormal distribution is a good fit, the data will lie along the red line).

**Figure 2** Normal probability plot for *Little Annie Rooney* (minus titles): LN[X]~N(1.2078, 0.7304)

The green curve in Figure 1 is the normal distribution defined if we take the median shot length and the estimate of σ derived from the mean-to-median ratio, as Salt recommends. According to Salt, the mean-to-median ratio for shot length data is equal to the exponentiate of half the variance (μ/med = exp (σ^{2}/2)) and that from this we can estimate σ. As we know the value of σ is 0.7304, this can be tested for *Little Annie Rooney*. The ratio of the mean-to-median ratio for this film is 4.6/2.9 = 1.5862 and exp (0.7304^{2}/2) = 1.3057. The mean-to-median ratio overestimates the true value by 21.5%. Inevitably, this leads to a poor estimate of σ: if μ/med = exp (σ*^{2}/2) then σ* = √ (2 × LN (μ/med)), and for *Little Annie Rooney* (minus titles) this produces an estimate of σ* = 0.9606. (It is perhaps not clear from the font used here, but √ is ‘square root’). The estimated value of the shape factor is greater than its MLE value by 31.5%. Looking at the function of LN[X]~N(1.0647, 0.9606) in Figure 1, we can see that it provides a better fit to upper tail of the data and is very close to the kernel density estimate. At the same time, it provides a very poor fit below the median, and is actually worse than the MLE parameters. This can be seen more clearly by looking at Figure 3, which is the normal probability plot assuming LN[X]~N(1.0647, 0.9606). (This already poor fit can be made worse by substituting μ for the median).

**Figure 3** Normal probability plot for *Little Annie Rooney* (minus titles) LN[X]~N(1.0647, 0.9606)

From this we can conclude that (1) the shot length data for *Little Annie Rooney* (minus titles) is not lognormally distributed; (2) that the mean-to-median ratio does not equal exp (σ^{2}/2); and (3) that using the mean-to-median ratio to derive σ* provides a very poor estimate of the shape factor. (Conclusion 1 should also lead us to question the method by which Salt claims to measure goodness of fit).

This same process cannot be applied to shot length data available of the Cinemetrics website for *Little Annie Rooney* with titles, as this data includes a shot length (presumably rounded down) of 0.0 seconds. (The logarithm of X ≤ 0 does not exist). This shot length does not appear in the data after the titles have been removed, and I find it hard to believe that this film had a title card that was present on screen for less than 0.05 of a second. The accuracy of this data with or without titles is questionable.

If we examine the shot length distributions of the silent short films of Laurel and Hardy (both with and without titles) we again find that (1) the assumption of lognormality is not justified, (2) the mean-to-median ratio does not provide reasonable estimates of exp (σ^{2}/2), and (3) σ* does not provide reasonable estimates of σ.

Calculating the probability plot correlation coefficient for these films with titles using a Blom plotting position and α = 0.05, the null hypothesis that the data is lognormally distributed is rejected for 10 of the 12 films. Repeating this process with the titles removed, the null hypothesis is rejected for 11 films. (Recall that a statistical hypothesis test is a test of plausibility of the null hypothesis for a given set of data – failure to reject the null hypothesis indicates only that there is insufficient evidence to reject [and does *not* prove] H_{0}). These results are presented in Table 1. The assumption of lognormality is not justified and removing the titles from the data does not affect this conclusion.

**Table 1** Probability plot correlation coefficient for the silent films of Laurel and Hardy with and without titles

Table 2 includes the mean, median, and the standard deviation of the log-transformed data (σ). Using this information, we can test Salt’s other claims regarding the mean-to-median ratio. (Actually this is all rather redundant as we already know that lognormality is not a plausible model for this data). *Early to Bed* was excluded from this part of the study as the log-transformed data exhibits bimodality.

**Table 2** Mean, median, and σ for the silent films of Laurel and Hardy with and without titles

First, let us ask if the mean-to-median ratio is equal to exp (σ^{2}/2) for these films. The results are presented in Table 3, and it is immediately clear that for only two films – *The Second Hundred Years* and *Angora Love* – does μ/med provide a reasonable estimate of exp (σ^{2}/2) when we include the titles in the data, and the PPCC test failed to reject the null hypothesis of lognormality for both these films. For every other film, μ/med overestimates the true value by ~10% or more. Once the titles are removed, we do not get the improvement Salt claims will be evident by censoring the data in this way. Generally, the change in the estimate once the titles are removed is small, although both *The Second Hundred Years* and *Angora Love* show much larger errors after the data has been censored due to an increase in the skew of the data.

**Table 3** Mean-to-median ratio and exp (σ^{2}/2) for the silent films of Laurel and Hardy with and without titles

Table 4 presents the maximum likelihood estimate of σ and the estimate derived by using σ* = √ (2 × LN (μ/med)) for the Laurel and Hardy films, both with and without titles. For the shot length data including titles σ* provides a poor estimate for those films that rejected the null hypothesis of lognormality in the PPCC test, and consistently overestimates σ by at least 12%. Again this is not surprising, as μ/med = exp (σ^{2}/2) is only valid if the data are lognormal, which is not the case here. Turning to the shot length data after the titles have been excluded, we see that σ* is a poor estimate of σ for all the films in the sample.

**Table 4** σ and σ* for the silent films of Laurel and Hardy with and without titles

From these results we can conclude that:

- The methodological justification for removing titles from the shot length data of silent films is obscure, and lacks a theoretical basis.
- There is no evidence to justify the assumption of that shot length data is lognormally distributed.
- There is no evidence that removing the titles from silent films will improve the fit to a lognormal distribution, and may in fact produce a poorer fit.
- The mean-to-median ratio does not provide a good estimate of exp (σ
^{2}/2). - Using the mean-to-median ratio to estimate the shape factor does not provide relaible results.

In other words, the approach suggested by Salt is wrong in every possible way.

Do not take my word for it. Do not blindly accept what someone tells you with scientific sounding words no matter how confident they sound. Learn to do it for yourself – it really is not that difficult to pick up enough statistics to be able to properly evaluate a research paper. Get some data and do your own testing. If you still get stuck then ask a statistician.

If you want to repeat the Laurel and Hardy tests performed above, I have added a spreadsheet to the Laurel and Hardy post (here) that includes the data with titles indicated.

Posted on November 25, 2010, in Cinemetrics, Film Analysis, Film Studies, Film Style, Laurel and Hardy, Silent cinema, Statistics and tagged Cinemetrics, Film Analysis, Film Studies, Film Style, Laurel and Hardy, Silent cinema. Bookmark the permalink. 2 Comments.

Great post, thanks ! Am not at all up to speed on the statistics (any hints for a quick intro?) but what you write makes me even more wary of all statistical analysis of silent films. My own research into the audience experience of those films in 1920s exhibition has given me so many examples of ways that silent films were changed, censored, slowed down, accelerated, re-edited, interrupted, and so on, when exhibited, that I’ve become a complete non-believer of the notion of a “text” that would be stable enough for researchers to analyze statistically — at least if audience experience is the goal. You may have written about this already somewhere else on your blog (just stumbled upon it today :-)), but I was just wondering, on a more general note, what you were making of this textual instability in your own statistical research? Any hope of salvaging the statistical method at all?

Pingback: Expanded sample for lognormal distribution « Research into film