# Measures of central tendency in Cinemetrics

Barry Salt has made several assertions about the nature of film style and the use of statistical methods in the analysis thereof. Chief among these are:

- That the mean shot length accurately describes the distribution of shot lengths in a motion picture.
- That the lognormal distribution is a ‘ruling distribution’ of film style.
- That a shape factor of 0.9 is characteristic of shot length distributions, or that it lies in the interval 0.7 to 0.9.
- The median shot length can be estimated as either equal to mean * 0.6 or equal to mean /
*exp*(0.5*σ*^{2}), where*σ*is the shape factor.

Here I subject each of the above claims to scrutiny by testing them against specific data sets for defined populations, and I examine the methodology proposed by Salt in detail.

## The mean shot length as a statistic of film style

Salt (1974) proposed that the mean shot length be used as a statistic of film style for shot length distributions. However, the distribution of shot lengths in a motion picture is typically characterised by two features: (1) it is positively skewed, and (2) there are a number of outlying data points that are far from the mean. (I say typically because there is no reason why the shot lengths of a film could not be distributed normally or negatively skewed, but I have not come across such a film). While the mean is the best measure of location for distributions that are symmetrical or near-symmetrical, it is a poor statistic when the data is asymmetrical (i.e. the data is skewed). *The mean is not a robust statistic*. When we say that a statistic is ‘robust,’ we mean that it is not influenced by data points that are very different from all the rest (outliers). The mean is very sensitive to such outliers and this can pull the mean away from the centre of the data creating a skewed data set: it has an asymptotic breakdown point of 0.0. The asymptotic breakdown point is a measure of the proportion of the data that can be given arbitrary values before the statistic becomes arbitrarily bad (Geyer 2006). So for the mean, the proportion of outlying data points in a sample that the mean can cope with is zero – just a single outlier can wreck the mean as a measure of central tendency, and in the distribution of shot lengths in a motion picture we can expect to find many outliers.

The asymmetrical nature of shot length data also limits the type of statistical tests that can be employed in statistical analysis [1]. Many tests require the assumption (amongst others) that the data is normally distributed. Tests which require an assumption to be made about he underlying probability distribution of the data are called *parametric tests*. However, for skewed data sets with a number of outliers, the assumption of normality does not hold, and the results of employing parametric tests will be unreliable. As no data is actually distributed normally, small deviations from the true normal distribution can be tolerated, but when we are dealing with shot length distributions we find that the deviations from normality are very large.

These problems may be overcome by using the median shot length as a statistic of film style. the median locates the middle of a data set by dividing the data in two, so that half the data is equal to or less than the median and half is equal to or greater than the median. The median locates the centre of any data set irrespective of shape, and is a more robust statistic than the mean: it has an asymptotic break down point of 0.5.

As we will be unable to use parametric statistical tests to analyse the data, we can turn to *nonparametric tests* which do not require the same assumptions to be met. Such tests, for example, make no assumption about the distribution of the data, and are often referred to as *distribution-free* tests. Thus, the two-sample independent *t*-test requires that the data in both samples be normally distributed; this is not the case for its nonparametric equivalent, the Mann Whitney *U*-test. This does not mean, however, that nonparametric statistics are assumption-free tests, and it is still necessary to make sure that the assumptions of each test can be met (e.g. that data is independently and identically distributed). As they require fewer assumptions about the nature of the data nonparametric tests are less powerful than their parametric equivalents, but when the requirements for the parametric tests cannot be met they provide a much better alternative than no analysis at all.

The mean shot length should never have been suggested as a statistic of film style. The median is a far superior statistic given the skewed nature of shot length distributions, and analysing film style using this measure of central tendency will provide results that are far more reliable than if we used the mean.

## Do motion pictures have lognormal shot length distributions?

An alternative to using the median shot length is to apply a transformation to data in order to remove the skew from the data. Such transformations include raising the value of a data point to a power or taking the reciprocal of a data point, and there are many others. A common transformation employed is to take the logarithm of a data point, which may produce a lognormal distribution. *A random variable (X) is said to be lognormally distributed if its logarithm (log [X]) is normally distributed*. If the shot length data for a film is lognormally distributed then we have an advantage over the median shot length, as knowing the underlying probability distribution will allow us to use the more powerful parametric statistical tests. However, just as the parametric tests are unreliable when the assumption of normality is not, the same is true if the assumption of lognormality is not met.

### The lognormal distribution and the mean shot length

Salt proposes that (1) the mean shot length is the appropriate statistic of film style and that (2) the shot lengths of a motion picture are distributed lognormally (2006: 389-396). However, these are conflicting claims. If the mean shot length was a reliable statistic of film style then we would not need the lognormal distribution: the reason we apply a logarithmic transformation to the data is because the mean does not provide a robust measure of the central tendency of the data in its original scale. Applying a logarithmic transformation to data allows us to recover the symmetry of the data, which we would not need to do if it were already symmetrical. To be specific, it is the *arithmetic mean shot length* that is incompatible with the lognormal distribution. The arithmetic mean is what we normally refer to when we say ‘mean:’ it is the sum of the data points divided by the number of data points in the sample. The arithmetic mean is NOT the measure of central tendency of the lognormal distribution. If we wish to locate the centre of the lognormal distribution then we use the *geometric mean*.

Logarithms are useful because they make complicated procedures like multiplication into simple operations like addition. For example, instead of multiplying two numbers together we can simply add their logarithms and by transforming the result back into the original scale we have our answer:

a * b = c ,

is the same as

log (a) + log (b) = log (c).

It does not matter which logarithm you use so long as you are consistent and use the correct method to back-transform the result. The two main logarithmic transformations are the common logarithm (log_{10} [*X*]), which uses base 10; and the natural logarithm (ln [*X*]), which uses base *e*.

Now, if we transform the length of each shot in a film (*X*) into its logarithm (log [*X*]), we can calculate its average in the usual way – that is, we add up all the logarithms and divide by the number of shots. Transforming the average of the logarithms back into the original scale will give us the geometric mean. In the original scale, this is the equivalent of multiplying all the shot lengths together and then taking the *n*th root, where *n* is the number of shots in the film. As we are using multiplication instead of addition, the geometric mean of lognormally distributed data is clearly going to be very different from the arithmetic mean. In fact, the geometric mean of a data set will always be less than the arithmetic mean (unless all the data points are equal).

In Table 1, the shot length data (*X*) for *A Busy Day* (Charles Chaplin, 1914) is presented along with the natural logarithm of each shot (ln [*X*]). (This film has a shot length distribution that *is* lognormally distributed – Shapiro-Wilk: *w* = 0.9783, *p* = 0.6556). The film is 340.4 seconds long and includes 38 shots (once title and dialogue cards have been removed). The arithmetic mean shot length of *A Busy Day* is 340.4 divided by 38, and equals 9.0 seconds. The sum of the logarithms is 65.4763, and dividing this figure by 38 gives 1.7231. Transforming this figure back into the original scale gives us a geometric mean of 5.6 seconds.

**TABLE 1** Shot length data in the original scale (*X*) and its natural logarithm ln (*X*) for *A Busy Day* (1914)

One of the reasons people get confused with the lognormal distribution is because, unlike the normal distribution, its expected value (*E*[*X*]) and its measure of central tendency are not the same. (For the normal distribution, the expected value and the measure of central tendency are both the arithmetic mean). The expected value of a Lognormal distribution is equal to the exponentiate of the geometric mean plus half the variance: *E*[*X*] = *exp*(*μ*+0.5*σ*²). For *A Busy Day*, the mean of the logarithms (*μ* – the geometric mean) is 1.7231 and the variance (*σ*^{2}) is 1.0259. If we add the geometric mean to half the variance (1.7231 + 0.5129) we get 2.2360. Transforming this value back to the original scale (the exponentiate) we get 9.4 seconds, which is approximately equal to the arithmetic mean shot length, and we know this does not locate the centre of the shot length distribution for this film. For a Lognormal data set, the geometric mean will be approximately equal to the median. We know that the median will locate the centre of any data set, and for *A Busy Day* the median shot length is 4.9 seconds – much closer to the geometric mean than the arithmetic mean. (Finding the median of the logarithms (1.5813) and then converting back to the original scale is equal to the median shot length. Using the median of the logarithms gives a poorer estimate of the expected value (8.1 seconds) than the geometric mean). For a discussion see Olsson (2005).

The shot length data for *A Busy Day* is presented above, so you can do this for yourself. NB: I calculated the variance for this film using Microsoft Excel’s function for finding the population variance (=VARP(array), where the denominator is *n*), while using the sample variance function (=VAR(array), where the denominator is *n*-1) will give *E*[*X*] = 9.5 seconds [2]. It should also be noted that, while Excel 2007 has a function for calculating the geometric mean (=GEOMEAN(array)), this will only work for up to 255 data points (and for earlier versions of Excel considerably less). Transforming each shot length to its logarithm and then finding the average will work for any sample size in Excel.

It is clear that the arithmetic mean shot length should never have been proposed as a statistic of film style. If, as Salt claims, shot lengths are lognormally distributed resulting from the multiplication of independent factors, then it is necessary to use a multiplicative central tendency and this is the geometric mean shot length. It is necessary to choose between two claims: are we going to claim that the arithmetic mean shot length is the best statistic of film style or are we going to claim the shot lengths are lognormally distributed? Since we already know that the arithmetic mean is not a robust statistic for skewed data sets and we have already resorted to using the logarithms of the data, it would seem obvious to jettison the arithmetic mean and to use the geometric mean as our statistic – assuming, of course, that the shot lengths of a motion picture are lognormally distributed.

This would, of course, mean that every time the arithmetic mean shot length has been quoted as a statistic of film style, it is simply wrong.

### Is the assumption of lognormality justified for the distribution of shot lengths in a motion picture?

In Salt (2006: 389-396), the ‘generality of the Lognormal distribution for shot lengths in movies’ is asserted but not demonstrated. Examples of some films that are claimed to have Lognormal shot length distributions are featured alongside some films for which this claim is not made, but the extent to which these claims can be generalised is in unclear. Salt admits that the sample in this study is not representative due to the presence of a number of films with very large mean shot lengths [3], and so on what is the claim that a lognormal distribution can be usefully used to model shot length distributions based? We do not know from which population the sample is drawn or what the sample size is. A further problem is that it is not clear what Salt defines as a film in which the shot lengths are lognormally distributed. The coefficient of determination is presented as a measure of goodness-of-fit, but there is no decision rule stated as to what value of *R ^{2}* can be considered ‘good.’ We do not actually know from this if the lognormal distribution is reliable enough to use in the analysis of film style, because we do not know how common it is for films to have lognormal distributions. Despite these problems, Salt has since made a much stronger claim that the lognormal distribution is a ‘ruling distribution’ of film style [4]. This claim assumes that at least a majority of films will have shot lengths that are lognormally distributed, although this has not yet been demonstrated.

Other probability distributions have been used in modelling shot lengths. Fujita (1989), for example, surveyed 32 educational television programmes and found that an Exponential distribution provided a good fit for the shot lengths in 30 cases. The Weibull, Gamma, and Poisson distributions (amongst others) have all been proposed as the best model for the shot lengths of motion pictures (Cotsaces *et al*. 2009, Taskiran and Delp 2002, Truong and Venkatesh 2005, Vasconcelos and Lippman 2000). Indeed, Salt (1974) used the Poisson distribution to model shot lengths, and also found films in which this hypothesised distribution did not hold.

It is a simple matter to estimate the proportion of films with a Lognormal distribution, and this is experiment is conducted below.

#### Sample

The samples used are the fifty films that I analysed earlier in my study of the impact of sound technology on the median shot lengths in Hollywood cinema. These films are divided into two samples: silent films produced between 1920 and 1928 inclusive (n = 20); and sound films produced from 1929 to 1931 inclusive (n = 30). The descriptive statistics for each film can be found by referring to my earlier paper.

#### Method

The shot lengths of each motion picture in the samples are transformed into their natural logarithms. The lognormality of the data is then tested using the probability plot correlation coefficient (PPCC) employing a Blom plotting position, with a significance level of 0.05 (Looney and Gulledge 1985, see my earlier post on how to do this). Where the PPCC for a film was just under its critical value, the result is checked using a Shapiro-Wilk test (α = 0.05).

The proportion of films with lognormally distributed shot lengths is then calculated, along with an approximate 95% confidence interval using the adjusted Wald method (Agresti and Coull 1998). This will be our estimate of the proportion of films that have lognormal distributions for the populations from which the samples are drawn.

Calculations were performed using Graphpad online calculators and PAST 1.89 (2009). The critical values for the PPCC can be accessed at the NIST website.

#### Results

The results of the PPCC test for the silent films are presented in Table 2, and for the sound films in Table 3. Only one film needed to be checked using the Shapiro-Wilk test: *Behind the Make-up* is not lognormally distributed (*w* = 0.9873, *p* = 0.0184).

**TABLE 2** Sample size and PPCC (α = 0.05) for silent films produced in Hollywood, 1920 to 1928 inclusive

**TABLE 3** Sample size and PPCC (α = 0.05) for sound films produced in Hollywood, 1929 to 1931 inclusive

Of the twenty silent films, six have Lognormal shot length distributions, and the proportion of silent films produced in Hollywood from 1920 to 1928 inclusive with a Lognormal distribution is estimated to be 0.30 (0.14, 0.52).

Of the thirty sound films, thirteen have Lognormal shot length distributions, and the proportion of sound films produced in Hollywood from 1929 to 1931 inclusive with a Lognormal distribution is estimated to be 0.43 (0.27, 0.61).

While some films do have shot lengths that are lognormally distributed, Salt’s statement that the lognormal distribution is a ‘ruling distribution’ of film style cannot be justified. In fact, in neither sample is there a majority of films with a lognormal distribution. If an analysis of film style is conducted using the assumption of then it is likely that the results will be unreliable.

The geometric mean is a superior measure of central tendency for skewed data sets with lognormal distributions. However, as no evidence has been presented that would justify the assumption that shot lengths are lognormally distributed the use of the geometric mean is questionable. Again, the median shot length is available as an alternative that can be used reliably as it locates the centre of a distribution as the middle ranked value in a data set, and does not rely on an underlying probability distribution.

## Do shot length distributions have a characteristic shape factor?

Each theoretical distribution is described by a set of parameters. The Lognormal distribution is described by the parameters and the shape factor, *σ*. Salt has claimed that the characteristic shape factor for the Lognormal shot length distributions of a motion picture is ~0.9 [5]. The relevance of this claim is lessened by the fact that there is no evidence to justify the claim that shot lengths are lognormally distributed. This claim is different to the one made in Salt (2006: 393), where it was asserted that the shape factor will lie in the interval 0.7 to 0.9.

Again, it is a simple matter to test both these claims.

#### Hypotheses

The first research question we are addressing here is ‘the lognormal shape factor of a shot length distribution is 0.9.’ The statistical hypothesis is:

- H
_{0}: the shape factor (*σ*) = 0.9

The second hypothesis we will address is the claim that ‘the lognormal shape factor of a shot length distribution will lie in the interval 0.7 to 0.9.’

#### Sample

The two samples of Hollywood films used above are employed in this test.

#### Method

The shape factor for each film is determined by maximum likelihood estimation (MLE) for the lognormal distribution. The mean value of *σ* for each data set is then calculated, and compared to the hypothesised value of 0.9 using a one sample *t*-test. A *p*-value of less than 0.05 is considered significant. The proportion of films with *σ* in the range 0.7 to 0.9 is then calculated, along with an approximate 95% confidence interval using the adjusted Wald method

MLE is performed using online calculators (Wessa 2008), and the *t*-test is performed using Microsoft Excel 2007. Graphpad online calculators were used to produce the confidence intervals for the proportions.

#### Results

The Lognormal shape factor for each film is presented in Table 4 for the silent films and Table 5 for the sound films.

**TABLE 4** Lognormal shape factors for silent films produced in Hollywood, 1920 to 1928 inclusive

**TABLE 5** Lognormal shape factors for sound films produced in Hollywood, 1929 to 1931 inclusive

The mean shape factor of the silent films is 0.7437 (*SD* = 0.0617), and is significantly lower than the hypothesised value of 0.9, *t* (19) = 11.3303, *p* = <0.0001.

The mean shape factor of the sound films is 0.9411 (*SD* = 0.1066), and is significantly greater than the hypothesised value of 0.9, *t* (29) = 2.1134, *p* = 0.0433.

If we take Salt’s alternative claim that *σ* will lie in the interval 0.7 to 0.9, then we can say that for the silent films this is a much more useful estimate, with a proportion of 0.70 (0.48, 0.86) in the specified interval. For the sound films, however, it is less good with a proportion of 0.47 (0.30, 0.64).

The hypothesised shape factor of 0.9 is not a good estimate for either sample, while the specified range of 0.7 to 0.9 is only a reasonable estimate for the silent films and even then we can expect over one-quarter of the films to lie outside this interval.

The claim that there is a characteristic shape factor for the distribution of shot lengths in a motion picture is not supported by the evidence.

## When is a mean shot length not a statistic?

There are clearly serious problems in using the arithmetic mean shot length as a statistic of film style, and Salt has tried to shift the justification for keeping the mean shot length to the argument that it can be used to estimate the median shot length [6]. To further add to the confusion of using the arithmetic mean with the lognormal distribution, we now have the claim that the mean shot is both the desired statistic of film style and is desirable as a means of estimating the median. Why, if the mean shot length is the statistic we desire, do we need these methods of estimating the median? Why, if the median can be estimated from the mean, has no one ever used this estimated median to describe changes in film style? As before, it is a question of competing claims: it is either the mean or the median, as they are different for skewed data sets, and not both. As it is a simple matter to demonstrate that the mean shot length is not a robust statistic, then it should be disposed of. Again, if the mean shot length is not the desired statistic of film style, then it would be necessary to admit that every time the mean shot length has been quoted in books and journal articles, this was wrong.

This is all very well, but it begs a fundamental question: is the estimated median any good?

### Can the median shot length be reliably estimated from the mean shot length?

Salt proposes two methods for estimating the median shot length from the arithmetic mean shot length, which, for the sake of simplicity, I shall refer to as Method A and Method B:

- Method A: median = mean * 0.6
- Method B: median = mean /
*exp(*0.5*σ*^{2}), where*σ*is the shape factor.

This two methods should produce approximately the same results when *σ* = 0.9.

Again this is simply an assertion and Salt provides no data or results to back up this claim.

#### Sample

The two samples of Hollywood films used above are employed in this test.

#### Method

For clarity, the following symbols are used:

*Med*is the true value of the median shot length.*Med*_{A}_{ }is the estimate of the true value of the median using Method A.*Med*is the estimate of the true value of the median using Method B._{B}

As Salt claims that *σ* = 0.9, using Method B is immediately problematic as I have already demonstrated that this is not a good estimate of the shape factor of the films in the two samples. In order to allow for this Method B is used twice – once where *σ* = 0.9, and once where *σ* is the MLE-derived value in Tables 4 and 5.

The value of *Med _{A}* or

*Med*is considered a good estimate for

_{B}*Med*if it is included in the 95% confidence interval of

*Med*. Note that this is not the same as saying that

*Med*or

_{A}*Med*will be equal to

_{B}*Med*– only that they will estimate

*Med*if they lie within an interval with a specified confidence level. These methods will therefore introduce some error into any analysis even when they are good estimates, but this error will be known.

The confidence intervals for the median were calculated using the binomial method. It is important to remember that while the shot length data itself is not binomially distributed, the median shot length is determined by its rank in the ordered sample. Therefore, when we calculate the confidence interval for a median were apply the binomial method to the ranks of the ordered data and then transpose this on to the ordered data – i.e. calculate the rank of the lower (*j*) and upper (*k*) limits of the interval for the proportion 0.5 and then the shot lengths that are ranked *j*th and *k*th in the ordered data. The binomial method is NOT applied to the shot lengths themselves. Using the binomial method tends to produce a conservative interval, but all the intervals are at least 95% and no film has a confidence interval greater than 96.41% [7]. See Curwin and Slater (2008: 296) for a simple introduction on how to do this and the large sample approximation.

The proportion of good estimates is calculated, along with an approximate 95% confidence interval using the adjusted Wald method.

#### Results

The results for the sample of silent films are presented in Table 6, and for the sound films in Table 7.

**TABLE 6** Median estimation for silent films produced in Hollywood, 1920 to 1928 inclusive

**TABLE 7** Median estimation for sound films produced in Hollywood, 1929 to 1931 inclusive

For the silent films, Method A produces a result that lies in the confidence interval of the true median only 4 times out of twenty (*P* = 0.20 [0.04, 0.37]). If we use this method we can expect our estimate to be outside the given confidence interval 80% of the time. Method B fares better for the silent films when *σ* = 0.9: out of twenty trials, the estimate was within the confidence interval for the true median on 13 occasions (*P* = 0.65 [0.43, 0.82]) – but this still means that it provides a poor estimate for approximately 1 in 3 films. When *σ* is the value derived by MLE, then the number of estimates that fall in the confidence interval of the true median is zero.

For the sound films, Method A provides a good estimate on 21 out of 30 occasions (*P* = 0.70 [0.52, 0.83]); and for Method B (*σ* = 0.9), the median is also well estimated 21 times. When *σ* is the value derived by MLE, then the number of estimates that fall in the confidence interval of the true median is 25 (*P* = 0.83 [0.66, 0.93]). These three methods when applied to the sample of sound films provide good estimates for the same film on 14 occasions, two methods provide good estimates on a total of twelve occasions (but it was not necessarily the same two for each of these twelve films), and on one occasion only a single method provides a good estimate. There are three films for which no method provides a good estimate. The different methods, then, provide different results for the same films.

The different methods proposed by Salt perform inconsistently across the two samples, and also produce different results when applied to the same sample. Overall, neither method provides a sound means of estimating the median shot lengths, and relying on median shot lengths estimated by these methods in the analysis will incorporate a large degree of error into the results as at least 17% of those estimates can be expected to lie outside the 95% confidence interval of the true median.

## Summary

Salt has made a number of assertions about the appropriate methodology for the statistical analysis of film style. When this methodology is examined in detail, and these claims are subject to statistical hypothesis tests, they cannot be justified:

- The mean shot length is not a reliable statistic of film style. The median and the geometric mean are both more reliable measures of central tendency for shot length distributions that are positively skewed with outlying data points.
- There is no evidence that the majority of films have shot lengths that are lognormally distribution, let alone any evidence to support the claim that the lognormal distribution is a ‘ruling distribution’ of film style. Consequently, the use of the geometric mean as a measure of central tendency is less reliable than that of the median.
- There is no evidence to support the claim that the characteristic shape factor of the distribution of shot length in a motion picture is 0.9; while the claim that the shape factor will lie in the interval 0.7 to 0.9 produces inconsistent results across the samples examined here, with between a quarter and a half of the films outside this interval.
- The methods for estimating the median shot length from the mean shot length are inconsistent, and are not sufficiently reliable. Use of these methods to estimate the median from the mean shot length will introduce a large amount of error into a study.

The implications for film studies are depressing. The mean shot length has been used as statistic of film style for over thirty years in a number of publications by a number of prominent film scholars (e.g. Barry Salt, David Bordwell, Warren Buckland, Charles O’Brien, Yuri Tsivian, Colin Crisp, etc.). Unfortunately all this research is simply wrong, and as these studies have been further cited by other scholars this mistake has been multiplied. There is now a whole range of so-called ‘statistical analyses’ of film style out there, but none of it is, in fact, correct. The statistical analysis of film style can make a significant and positive contribution to our understanding of the cinema, but this first requires an understanding of statistics. Before the statistical analysis of film style can be good film studies, it must first be good statistics. Good statistics is the one thing we do not have at present. This problem goes back 35 years and the introduction of the mean shot length as a statistic of film style.

What is truly disheartening is that the mistakes made by film scholars in this area are elementary: in the UK, knowing when to use the mean and the median is GCSE statistics. The current specification for the AQA statistics syllabus clearly requires students – not university professors with Ph.D.s, but 14-16 year old school pupils – to understand the ‘advantages and disadvantages of each of the three measures of location [mean, median, mode] in a given situation,’ and to provide a ‘reasoned choice of a measure of location appropriate to the nature of the data and the purpose of the analysis.’ You can even get extra marks if you discuss the geometric mean! Any basic statistics course will tell you that you need to cite measures of dispersion alongside measures of central tendency. Every text book ever written on the subject discusses the meaning of the word ‘significant’ in the context of statistics. The application of statistics in film studies falls below these basic standards.

Do not take my word for it. Go and learn some statistics, or ask a statistician to show you how to do it. Get some data and do the analyses for yourself – and by analysis I mean actually formulate the hypotheses and do the tests rather than simply asserting two numbers are different and that this is ‘significant.’ Do not simply quote statistics when you do not know what from what population the sample was drawn, or when you do not know what the statistics are supposed to describe, or when you do not know what decision rule was employed, or when you do not know what tests were used. There really is nothing difficult about any of this.

*Nulius addictus judicare in verbia magistri*

## Notes

- I have assumed that film scholars will be using statistical tests to test hypotheses about data, but I have not actually come across anyone who has used a single statistical test in film studies. It is typical for film scholars to cite some means (without any accompanying measure of dispersion), and then simply to assert that a difference does or does not exist and that such a result is ‘significant.’ What they mean by ‘significant’ is not clear, but this is a term with a precise meaning in statistics and should not be abused. The statistical analysis of film style is scarcely statistics.
- The reason for using the population variance is to be consistent with the MLE values given for
*σ*, which were calculated based on the population standard deviation. - See Salt’s comment to my post ‘Testing Normality in Cinemetrics‘ dated 21 May 2009.
- See Salt’s comment to my post ‘The impact of sound on film style‘ dated 25 September 2009. Note that in his comment to this post and the one cited above in note 3 Salt gives two different sets of figures for a set of 40 films I tested by the same method here, and that he gets both of them wrong. I can only assume that he has counted some films that appear in different posts twice. For the record (1) this is not a representative sample drawn from a population (2) there are 40 films in the table, and (3) half the films (20) have lognormal shot lengths. If we add the fifty films above to those forty films (remembering to remove the ones that overlap) we have a total set of 81 films of which 35 have lognormal shot lengths.
- See Salt’s comment to my post ‘Location and spread in shot length distributions‘ dated 15 November 2009.
- See Salt’s comment to my post ‘The impact of sound on film style‘ dated 25 September 2009.
- A method for constructing exact confidence intervals for the median has been described by Bonnet and Price (2002), and there is a spreadsheet that can be downloaded to do this automatically. By all accounts this should be a better method than the binomial, but I have not been able to get hold of a copy of the article in which this method is described and so I am reluctant to use it without first understanding how it works.

## References

**Agresti A and Coull B** 1998 Approximate is better than ‘exact’ for interval estimation of binomial proportions, *The American Statistician* 52: 119-126.

**Bonett DG and Price RM** 2002 Statistical inference for a linear function of medians: confidence intervals, hypothesis testing, and sample size requirements, *Psychological Methods* 7 (3): 370-383.

**Curwin J and Slater R** 2008 *Quantitative Methods for Business Decisions*, sixth edition. London: Thomson Learning.

**Cotsaces C, Nikolaidis N, and Pitas I** 2009 Semantic video fingerprinting and retrieval using face information, *Signal Processing: Image Communication* 24 (7): 598-613.

**Fujita K** 1989 Shot length distributions in educational TV programmes, *Bulletin of the National Institute of Multimedia Education* 2: 107-116.

**Geyer CJ** 2006 Breakdown point theory notes, http://www.stat.umn.edu/geyer/5601/notes/break.pdf, accessed 9 December 2009.

**Looney SW and Gulledge TR** 1985 Use of the correlation coefficient with normal probability plots, *The American Statistician* 39 (1): 75-79.

**Olsson U** 2005 Confidence intervals for the mean of a lognormal distribution, Journal of Statistics Education, Volume 13, Number 1, www.amstat.org/publications/jse/v13n1/olsson.html, accessed 18 November 2009.

**Salt B** 1974 Statistical style analysis of motion pictures, *Film Quarterly* 28 (1): 13-22.

**Salt B** 2006 *Moving into Pictures: More on Film History, Style, and Analysis*. Starwood, London.

**Taskiran CM and Delp EJ** 2002 A study on the distribution of shot lengths for video analysis, SPIE Conference on Storage and Retrieval for Media Databases, 20-25 January 2002, San Jose, CA. Available online: http://ctaskiran.com/papers/2002_ei_shotlen.pdf, accessed 7 August 2009.

**Truong BT and Venkatesh S** 2005 Finding the optimal temporal partitioning of video sequences, *Proceedings of IEEE International Conference on Multimedia and Expo*, 6-9 July 2005, Amsterdam, Netherlands: 1182-1185.

**Vasconcelos N and Lippman A** 2000 Statistical models of video structure for content analysis and characterization, *IEEE Transactions on Image Processing* 9 (1): 3-19.

**Wessa P** 2008 Maximum-likelihood lognormal distribution fitting (v1.0.2) in free statistics software (v1.1.23-r4), Office for Research Development and Education, http://www.wessa.net/rwasp_fitdistrlnorm.wasp/, accessed 15 November 2009.

Posted on December 10, 2009, in Cinemetrics, Film Analysis, Film Studies, Film Style, Film Theory, Uncategorized and tagged Cinemetrics, Film Analysis, Film Studies, Film Style, Film Theory. Bookmark the permalink. 1 Comment.

You misrepresent what I have said in the first paragraph of this piece. I have never said that the mean shot length gives the best description of the shot length distributions in films. It is merely the one I use most, for its convenience. I agree that the median is better — obviously.

And I have never said that the median can be determined accurately from the ASL. Merely that it can be roughly estimated.

From your results the lognormal distribution is undoubtedly the only standard distribution that does a reasonably good job of fitting shot length distributions, so it might well be called “the ruling distribution”

I could say more, but sufficient to say that you are not going to make yourself any friends by twisting what people have said and then abusing them.