# Some notes on Cinemetrics III

This post was originally intended to be part of a thread on the discussion board of the Cinemetrics website, but for some reason it did not upload properly. This post presents my piece in full, but readers should refer to the original thread to get the preceding parts of the discussion.

The difference between the median/mean ratio for two film or two groups of films (e.g. silent films and sound films) can be explained by the presence of outliers in the data and the influence they have on the mean shot length. This can be demonstrated by looking at the shot length distributions for the two versions of Blackmail, The Lights of New York, and Scarlett Empress. (The data for these films can be found in the Cinemetrics database).

Imagine you have two data sets that are identical except for a single value. For example,

A:  1, 2, 3, 4, 5, 6, 7, 8, 9, 10

B:  1, 2, 3, 4, 5, 6, 7, 8, 9, 20

For data set A, the median is 5.5, the mean is 5.5, and so the median/mean ratio is 5.5/5.5 = 1.0. For data set B, the median is 5.5, the mean is 6.5, and the median/mean ratio is 5.5/6.5 = 0.85. The changes in the mean and the median/mean ratio are due to the influence of a single outlying data point, and do not reflect the fact that the two data sets are otherwise identical.

This is precisely what we see when we look at the two versions of Blackmail. In the table below, we have the mean shot length, the median shot length, and the ratio of the median to the mean.

 Blackmail (silent) Blackmail (sound) Median shot length (s) 5.6 5.1 Mean shot length (s) 8.1 10.4 Median/mean 0.69 0.49

Looking at the mean, we might think that the impact of sound technology was to lead to a change in style, with an increase in shot lengths (the difference in the means is 2.3 seconds). However, we know that shot length distributions are positively skewed with outlying data points, and that the mean is, therefore, problematic. The difference in the medians is small (only 0.5 seconds), indicating that no such change occurred. This conclusion is supported by a medians test, which shows no significant difference: p = 0.135. A more complete picture may be obtained by looking at the five number summary for each film.

 Blackmail (silent) Blackmail (sound) Minimum shot length (s) 0.1 0.1 Lower quartile (s) 2.9 2.5 Median shot length (s) 5.6 5.1 Upper quartile (s) 10.1 11.5 Maximum shot length (s) 104.3 144.6

Looking at this data, we would conclude that the difference between the styles of these two films occurs above the upper quartile – the difference is in the length of the outlying data points away from the mass of the data. The lower quartiles in the silent and sound versions are similar – each film has approximately the same proportion of shots less than or equal to 2.9s (25% and 29%, respectively). This is also the case for the medians: half of the shots in the silent version are less than or equal to 5.6 seconds, while half the shots in the sound version are less than or equal to 5.1 seconds. The difference between the upper quartiles is greater (1.4 seconds) but is still less than the difference between the means – where 75% of the shots in the sound version are less than or equal to 11.5 seconds, this proportion in the silent version is 79% for the same value. In fact by looking at the empirical cumulative distribution functions for both versions of Blackmail (see Figure 1) it is clear that they have almost identical distributions; and a 2-sample Kolmogorov-Smirnov test shows that there is no statistically significant difference for any shot length in the two versions of this film (D = 0.0666, p = 0.1881). (Note that the distribution functions in the graph below are empirical – i.e. they are the actual probability distributions of the shot length data from the Cinemetrics database, and they are not theoretical distributions).

Figure 1 The empirical cumulative distribution functions for the two versions of Blackmail (1929)

The only explanation for the difference in the means, and for the difference in the median/mean ratios, is the influence of the outliers on the mean. Using the mean – or any statistic based on the mean – will lead to incorrect conclusions. Using the mean, we might conclude that the shot lengths in the two versions of Blackmail show a statistically significant increase with the use of sound technology – but this would be wrong. As in the example data sets above, the difference we find in the median/mean ratios reflects the influence of these outlying data points, and does not accurately reflect the distribution of shots in the two versions in the two versions of Blackmail. When we use measures of dispersion that are robust against outliers we do not see the large difference in the dispersion of shot lengths we would expect with the median/mean ratio. The median absolute deviation for the silent version is 3.3 seconds and for the sound version is 3.1 seconds; while the interquartile ranges are 7.2 and 9.0 seconds, respectively.

Let us now look at the shape factors of the two versions of Blackmail. The standard deviation of the logarithms of the shot lengths for the silent version is 0.91; and the equivalent value for the sound version is 1.05. This suggests that the two distributions have different shapes – the sound version being more widely dispersed than the silent version. However, from the five number summary, the empirical cumulative distribution functions, and the robust measures of dispersion we can see that this is not the case. Therefore, we have two versions of the same film with shot length distributions that show no statistically significant difference, but with a large difference in the median/mean ratio and a corresponding difference in the lognormal shape factors. These differences are due to the influence of outlying data points, and do not accurately reflect the nature of the relationship between these two distributions.

A second example can be used to illustrate what happens when we have two films with the same mean but different median shot lengths. The mean shot length, the median shot length, and the ratio of the median to the mean for Lights of New York and Scarlett Empress are presented in the table below.

 Lights of New York Scarlett Empress Median shot length (s) 5.1 6.5 Mean shot length (s) 9.9 9.9 Median/mean 0.52 0.66

Looking at the mean shot lengths, we can see that they are identical and we might conclude that these films are cut equally quickly; but looking at the medians we can immediately see that there is a difference of 1.5 seconds, which alerts us to the possibility that Lights of New York is cut quicker than Scarlett Empress. A medians test tells us that there is, in fact, statistically significant difference in the medians of these two films: p = 0.0007.

Now, the median/mean ratio is a crude measure of the dispersion of skewed data set, and the smaller the value of this ratio the more dispersed the data (i.e. the greater the distance between the median and the mean). For symmetrical distributions the median and the mean are equal and the ratio is one; but as shot length distributions are positively skewed the mean will always be greater than the median and the ratio will always, therefore, be less than one. (The ratio is typically given in economic text books as the mean divided by the median [it is used as a measure of income inequality], but this is just the reciprocal of the median/mean ratio). Clearly, the use of the mean as a measure of central tendency will lead us to an incorrect conclusion about the difference in style of Lights of New York and Scarlett Empress; but does it fare any better as an indicator of which film has shot lengths that are more dispersed?

According to the table above, Lights of New York has a smaller median/mean ratio (0.52) and so we would expect the shot lengths for this film to be more dispersed than those of Scarlett Empress (0.66). The standard deviation for Lights of New York is 14.5 seconds, and for Scarlett Empress it is 9.6 seconds – again indicating that the former is more dispersed than the latter. (The lognormal shape factors for these two films are 0.93 and 0.88, respectively). However, when we look at the median absolute deviation and the interquartile ranges we get a different picture.  For both statistics, it is evident that Scarlett Empress is, in fact, more dispersed.

 Lights of New York Scarlett Empress Median absolute deviation (s) 2.6 3.5 Interquartile range (s) 7.2 9.3

This can be easily seen when looking at the box plots of these two films (Figure 2). In the box plots note that the interquartile range (the box) for Lights of New York is narrow than that for Scarlett Empress; and that the distance between the minimum shot length (the end of the error bar to the left) and upper inner fence (the error to the right of the box at Q3+(IQR*1.5)) is less for the former (0.9 – 20.9 seconds) than it is for the latter (0.3 – 26.9 seconds). Anything above the greater of these values (i.e. 20.9s and 26.9s is classed as an outlier, while ‘very extreme’ values are defined as Q3+(IQR*3).

Figure 2 Box plots for shot lengths in Lights of New York (1928) and Scarlett Empress (1934)

It is evident, therefore, that (1) shot lengths in Scarlett Empress are more dispersed than those of Lights of New York; and (2) the reason the mean shot length, the median/mean ratio, and the standard deviation give misleading results is because of the influence of the outlying data points in Lights of New York (which account for only 9.76% of this film’s data).

Is the median/mean ratio still useful for estimating the median for these four films? For the silent version of Blackmail the estimated median based on the shape factor given above is 5.3 seconds; and for the sound version is 5.9 seconds. Thus, the first estimate is good with an error of 0.3 seconds (or 4.7%); while the second estimate is less good and is out by 0.8 seconds (16.1%). For Lights of New York, the estimate of the median is 6.4 seconds – an error of 1.3 seconds or 25.1%! The estimate for Scarlett Empress is much better at 6.7 seconds, and is out by only 0.2 seconds (3.7%). We can see, therefore, that estimating the median as the mean/(exp(0.5*(σ^2))) may produce very good estimates but may also produce very bad ones.

The mean is not a robust statistic, and is vulnerable to two factors: the presence of outliers in a data set and the asymmetry of a data set. Unfortunately, these are precisely the characteristics of the distribution of shot lengths in a motion picture. Any value calculated using the mean (e.g. the standard deviation, the median/mean ratio) will not accurately reflect the style of a film due to the impact of outlying data points on the mean. Use of the mean will, therefore, leads us to make a range of incorrect conclusions.

• In the case of the example of Blackmail, we would have incorrectly concluded that there is a difference between the shot lengths of the two versions of this film, when in fact there is no such difference.
• In the example of Lights of New York and Scarlett Empress, we would have incorrectly concluded that there is not a difference between the shot lengths of these two films, when in fact there is such a difference.
• In the example of Lights of New York and Scarlett Empress, we would have incorrectly concluded that the shot lengths of the former are more widely dispersed than in the latter, when in fact the opposite is true.
• Using the mean may produce wither very good estimates of the median or it may produce very bad estimates of the median. Simply relying on this method to lead us to reliable conclusions will not work: if we used the estimate of the median for the sound version of Blackmail in a study we would be basing our analysis on a fundamental error.

The mean shot length is not a reliable statistic of film style. The median/mean ratio suffers from precisely the same problem that has always existed with the mean. It is just a different way of presenting it.

(I’m currently looking at the impact of sound on Hitchcock’s style and Blackmail in more detail, and I’ll put up a post on this subject at a later date. I began working on this piece a couple of months ago and the data I have been using (and the data referred to above) was submitted in 2006 by Isobel Walker. Charles O’Brien has recently submitted new data for the sound version of Blackmail, but I have not looked at this in detail yet).