Parameters, statistics, and confidence intervals
One of the problems with the way in which statistics has been applied to the analysis of film style is the failure to respect the difference between parameters and statistics.
A parameter describes a characteristic of a population, whereas a statistic is an estimate of a parameter calculated from a sample drawn from the population under study.
If we determined the median shot length of every film produced in Hollywood in 1919, then we would be able to state the median of these medians as a parameter of the complete population of these films. However, it is not possible to find the median shot length of every Hollywood film produced in 1919 – the number of films that we could analyse numbers in the hundreds and so it is not time or cost efficient to do so, while there are many films that are no longer extant and cannot be analysed. We cannot know the median of the median shot lengths as a parameter of this population, but we can estimate it by finding the corresponding value for a sample of Hollywood films from 1919.
For example, the following sentence appears in Bordwell D, Staiger J, and Thompson K 1985 The Classical Hollywood Cinema: Film Style and Mode of Production since 1960. London: Routledge.
Between 1917 and 1927, the average shot ran between five and six seconds; between 1928 and 1934, the length was closer to eleven seconds (304).
This statement is not informative: it does not specify what is meant by ‘average’ (it’s actually the mean, which is the wrong average); the values given are vague (how ‘close’ is ‘closer to eleven’ and closer than what?); and it gives the impression that the quoted figures are parameters, when in fact they are statistics. These values are estimates based on a sample, although there is no indication of this.
A further problem is that the term ‘parameter’ is often confused with the term ‘variable:’ Warren Buckland, for example, refers to ‘parameters of film style’ when he means ‘variables’ (see here). A variable is simply a characteristic that varies in value from subject to subject – the number of close-ups in a film is a variable ( it varies from film to film), while the mean number of close-ups in a sample of films is a statistic that estimates the mean number of close-ups in the population of films. For a useful review of the distinction between the term a ‘parameter’ and ‘variable’ see Altman DG and Bland JM 1999 Variables and parameters, British Medical Journal 318: 1667, which can be accessed here.
To summarise, the statistic calculated from a sample is an estimate of the parameter of interest from the population under study. Statistics are typically denoted by roman letters and parameters by greek letters. For example, the mean of a population is μ and of a sample is M or ȳ (y-bar), the population standard deviation is σ and the sample standard deviation is s,* and the population correlation coefficient is ρ and that of a sample is r.
(* Note that calculation of the population standard deviation and that of the sample standard deviation are slightly different. See here).
A statistic is a point estimate of a parameter, but it is very unlikely that the two will be exactly identical. As an estimate of an unknown quantity, a statistic will be subject to sampling error. It is necessary to provide some information on the margin of error of the estimate. This can be done using confidence intervals, which allow us to understand the precision of our estimate by providing a range of possible values for a parameter.
A confidence interval allows us to state that the true value of a parameter lies within a range of values determined from the sample at a specified level of confidence (90%, 95%, 99%, etc.). The confidence level does not tell us that the probability that the statistic is the true estimate of a parameter is, say, 95%. It tells us that if we took repeated samples from the same population (say 100 samples of Hollywood films produced in 1919) and calculated the median of the median shot lengths of the films and a confidence interval for each sample median, then we can be confident of these 100 intervals, 95 would capture the true parameter. Of course, you cannot know if the 95% confidence interval you have calculated in any individual experiment is one of the 95% that capture the true value of the parameter or if it is one of the 5% that do not, but at least you can estimate and interpret the error of the experiment.
We can also calculate confidence intervals for the estimates of the difference between the statistics of two samples (e.g. a confidence interval for the difference between two medians or two means). We can also calculate confidence intervals for proportions and differences of proportions. I will illustrate those methods in a later post (although see here for the 95% confidence interval between two proportions).
Confidence intervals for a mean
To calculate a confidence interval for a mean we need four pieces of information. From our sample, we determine the size of the sample (n), the sample mean (M), and the sample standard deviation (s).
We also need a critical value that determines the width of the interval, the value of which depends on the size of the interval we wish to calculate (90%, 95%, 99%, etc) and the size of the sample.* Because we must estimate the standard deviation from a sample, the example here uses a t-value. To find the value of t, we need to know the probability size of the interval we wish to create and the degrees of freedom (df). The required number of degrees of freedom for the confidence interval of a mean is simply the sample size minus 1.
The equation for calculating a confidence interval is very simple:
M ± (t × SE),
where SE is the standard error of the mean. In other words, the confidence interval is the mean plus or minus t multiplied by the standard error.
For example, using the distribution of the median shot lengths of silent films produced in Hollywood from my earlier study (here), we wish to calculate a 95% confidence interval for the mean of this data. The values (arranged in order of size) are 3.2, 3.2, 3.4, 3.4, 3.5, 3.7, 3.7, 3.9, 4.1, 4.3, 4.4, 4.4, 4.8, 4.9, 5.1, 5.2, 5.3, 5.4, 5.7, 5.7.
Looking at this data, we find that the sample size is 20, the mean median shot length is 4.4 seconds, and the standard deviation is 0.85 seconds. To construct a 95% confidence interval for the mean:
- Calculate the standard error by dividing the standard deviation by the square root (√) of the sample size: SE = s/√n = 0.85/√20 = 0.85/4.47 = 0.19.
- Find the required t-value using Microsoft Excel’s TINV function, where the probability is 100%-95% = 5% and the degrees of freedom is n-1 = 20 – 1 = 19. Entering =TINV(0.05, 19) in Excel will gives us a value of 2.09 for t.
- Multiply the standard error by t: 0.19 × 2.09 = 0.40.
- To find the lower limit of the confidence interval we subtract 0.40 from the mean, and to find the upper limit we add 0.40 to the mean. This gives us a 95% confidence interval of 4.0 to 4.8 seconds.
When reporting our results we write that ‘the mean median shot length of silent Hollywood films produced in the 1920s is estimated to be 4.4s (95% CI: 4.0, 4.8).’ (In practice the ‘estimated to be’ bit is often missed out because it is obvious that we are talking about a statistic rather than a parameter – we should have described the sample in either the methodology and/or results sections of our paper). A point estimate and its confidence interval can be represented graphically (see Figure 1).
Figure 1 95% confidence intervals for the mean and median of the median shot lengths of silent Hollywood film produced in the 1920s (n = 20)
If you find this all to strenuous, then you can use the online calculators at Graphpad to do it for you (here). The calculator for the confidence interval of a mean is found under the ‘continuous data’ menu.
* For a large sample (n > 120) we can use the appropriate z-value (e.g. 1.96 for a 95% confidence interval) instead of t, as t gets closer to z as the sample size increases. Most text books will have a table of t-values for df of up to 120. If you want to find the critical z-value for a probability of 5% using Excel then you have to divide the probability by 2 and use =-NORMSINV(0.025), which gives 1.96 (note the minus sign before the command). Quite why Excel requires you to enter the probability differently for t and z is unclear, but it’s easy to get caught out.
Confidence intervals for a median
Calculating a confidence interval for a median is more complicated because it is harder to algebraically manipulate the median than the mean.
We can produce a confidence interval for the median using the binomial method, which is simple to do but the intervals tend to be conservative (i.e. they tend to produce intervals with a true coverage slightly greater than the stated coverage). They may also be asymmetric.
The median is an order statistic (it is the middle value of a data set ordered from largest to smallest), and to find its confidence interval we need to find the jth and kth largest values in the ordered data set using the equations
j = nq – 1.96√(nq × [1-q])
k = nq + 1.96√(nq × [1-q]),
where n is the sample size, q is proportion for which we wish to find a confidence interval (i.e. the median, q = 0.5), and 1.96 is the critical value for a 95% confidence interval. Using the same data that we used above, n = 20 so,
j = (20 × 0.5) – (1.96 × √[20 × 0.5 × (1-0.5)]) = 10 – (1.96 × √5) = 5.6
k = (20 × 0.5)+ (1.96 × √[20 × 0.5 × (1-0.5)]) = 10 + (1.96 × √5) = 14.4
Clearly there isn’t a 5.6th or 14.4th value so we round these figures up to the nearest integer to give j = 6 and k = 15. Now we find the 6th largest and the 15th largest values in our dataset, and these values are the lower and upper limits of our confidence interval. From the above ordered list we see that the 6th largest value is 3.7 seconds and that the 15th largest value is 5.1 seconds. We can now present our results stating that ‘the median of the median shot lengths of silent Hollywood films produced int he 1920s is estimated to be 4.4 seconds (95% CI: 3.7, 5.1).’ Note that this confidence interval is slightly too large: the actual coverage is 95.86%. It is also slightly larger than the confidence interval calculated for the mean value – the standard error of the mean for a normal distribution is smaller than that of the median (the mean is more efficient if the data is normal). However, if the data contains outliers (such as the interquartile range data for early sound films from the same paper), then the confidence intervals for the mean may be the wrong size and lead to faulty estimates. The median is more robust than the mean when dealing with this problem.
An easier route to finding the answer is to use the Excel spreadsheet designed by Gianmarco Alberti (here, and which does lots of other useful things as well) or the Excel spreadsheet accessed from the Univeristy of Groningen here. Both spreadsheets apply the method described by Bonett DG and Price RM 2002 Statistical inference for a linear function of medians: confidence intervals, hypothesis testing, and sample size requirements, Psychological Methods 7 (3): 370-383. Using these spreadsheets we have a 95% confidence interval for the median of the above data of 3.7 to 5.0 seconds, which compares well to the binomial method even when using a sample as small as the one here (see Figure 1).
Ranges of values
David Bordwell has stated that it is preferable to define a range of values for the average shot lengths of group of films rather than rely upon a single point estimate (see Bordwell D 2005 Figures Traced in Light: On Cinematic Staging. Cambridge, MA: Harvard University Press: 273, n37). This is an entirely reasonable approach, but it is not clear what Bordwell means when he refers to a ‘range of probable choice’ for filmmakers or what the ranges actually specify. (Do filmmakers choose to make their films so that they have a probable average shot length? What does ‘probable’ mean in this context?). For example, in Bordwell et al. (1985: 304) we find the following sentence relating to early sound films:
… most commonly, a film’s average shot length would lie between eight and fourteen seconds.
How are we supposed interpret this statement? What does ‘most commonly’ mean in this context? Does it mean ‘more than half?’ Or does it mean 75%, 90%, or some other value? (Note again that ‘average’ is not defined, and that it is the wrong average). Without a point estimate, we cannot understand how the average shot lengths vary: if the mean average shot length is 11 seconds then we can see that, ‘most commonly,’ the mean shot length lies in an interval of 11 ± 3 seconds. We would interpret this interval differently if the mean average shot length was 13 seconds (assuming we knew what the interval actually represented).
Confidence intervals have a defined meaning and they are related to the point estimate of the parameter we are interested in, but they are somewhat different to the type of intervals suggested by Bordwell.
The five-number summary (the minimum value, the lower quartile, the median, the upper quartile, and the maximum value) fulfils the statistical needs of Bordwell exactly. From the five-number summary we can find the range (the maximum minus the minimum) that includes all the data. Or we can find the interquartile range (the upper quartile minus the lower quartile), which describes the spread of the middle 50% of the data. This information can be represented visually by a box plot (see Figure 2). If we so desired, we could find the range between any two points by simply calculating their percentiles and subtracting the smaller value from the larger.
The range of the above data set is 5.7 – 3.2 = 2.5 seconds.
The interquartile range is 5.1 – 3.7 = 1.5 seconds. Note that although the interquartile range and the 95% confidence interval for the median are the same, they have very different meanings. The interquartile range is a measure of dispersion and tells us that the middle 50% of the data lies between 3.7 and 5.1 seconds; whereas the confidence interval is a measure of precision and allows us to estimate the probable values for the sample median.
Figure 2 Box plot of the median shot lengths of silent Hollywood film produced in the 1920s (n = 20). The diamond represents the sample mean.
If we remove the two lowest values and the two largest values (i.e. the bottom 10% and the top 10%), then we can see that 80% of the values lie between 3.4 and 5.4 seconds.
There is no reason why a statistic, its confidence interval, and a range value could not be given to describe the style of a group of films, but their meaning needs to be made clear. What is lacking in statistical research into film style is simply proper description of the study conducted, proper description of the data, and conclusions that have a clearly defined statistical meaning. The mean isn’t very interesting on its own – it’s the meaning of the mean that counts and that requires clarity and confidence intervals.
Posted on February 10, 2011, in Cinemetrics, Film Analysis, Film Studies, Film Style, Statistics and tagged Cinemetrics, Film Analysis, Film Studies, Film Style, Statistics. Bookmark the permalink. 2 Comments.