Category Archives: Box Plots
This post addresses some issues raised by Mike Baxter as part of the ‘cinemetrics conversation’ at the Cinemetrics website (and is the post I would have produced last week had I been able to remember which bit of software had the right command to create the necessary graph). You can find an introduction to the conversation here and my first response to some of the issues raised here.
I want to address two issues: first, the nature of outliers in shot length distributions and better methods of representing such distributions than I have used up to now; and, second, the straw-man the median shot length has become in Baxter’s comments.
Baxter’s comments in response to the earlier can be found in the second tab under his name here. In section 2 Baxter questions my use of the term ‘outlier’ and the definition used to identify such shots. This is fair enough – we wouldn’t get very far if such definitions weren’t questioned. In the examples of Lights of New York and Scarlett Empress, Baxter argues there is no evidence of outliers since
it’s difficult to identify any point at which ‘extremes’ begin, or discontinuities in the distribution of the kind I think are needed to assert, with any confidence, that you are dealing with ‘outliers.’
Baxter never defines what such a discontinuity would look like and so his argument is vague. (Arguably this is the semantic version of a slippery slope).
Figure 1 shows the kernel density and box plot of Lights of New York. There is a 12.2 second gap between the five shots of longest duration and the sixth longest, presumably the sort of discontinuity Baxter refers to and he does concede he might be prepared to accept five shot lengths as extreme values (though he does not say on what basis). From Figure 1 we can see there are in fact several such discontinuities, and that the kernel density is zero at several points in the upper tail (indicating the kernels do not overlap), particularly above 30 seconds (which corresponds to the 22 extreme outliers identified using this type of box plot). However, a limitation of this boxplot is that it does not take into account the skew of the distribution and so over identification of outliers is a problem.
Figure 2 presents the same data using an adjusted boxplot that takes into account the skewed nature of the data. This method uses the med-couple, a robust measure of skewness, to identify outliers. The adjusted boxplot can be generated using the adjbox() command in the R package robustbase.
The number of outliers in Figure 2 is much less than in the original boxplot: in the upper tail 10 shots greater than 55 seconds are identified as outliers (or 3% of the total). Nonetheless, there are still some values which are sufficiently removed from the rest of the data to be classed as outliers even when accounting for the asymmetry of the distribution. Whether or not Baxter would accept this definition would depend on the interpretation of his use of the term ‘discontinuity,’ which he does not define.
Surprisingly, this method identifies three outliers in the lower tail of the distribution (which I wasn’t expecting and will have to think about more).
The following article describes the adjusted boxplot and its calculation:
Vandervieren E and Hubert M 2008 An adjusted boxplot for skewed distributions, Computational Statistics and Data Analysis 52 (12): 5186-5201. An ungated, earlier version of this paper can be accessed here.
Even if we accept Baxter’s argument that there are no outliers in Lights of New York it remains necessary to be aware of the problems caused by outliers in data sets and to check the distribution of shot lengths so that we are not be fooled by non-robust statistics. Certainly more effort will have to be devoted to defining what is or is not an outlier (in either statistical or filmic terms) in research if this type. (But it is much easier when you remember which bit of software to use).
Finally, I wish to address a misrepresentation that has taken a hold at this early stage in the ‘cinemetrics conversation.’
the use of either the ASL or median as the statistic for attempting to summarise ‘style’ doesn’t make much sense (as Salt observes) [original emphasis].
This argument is a straw-man.
I have never stated that the median shot length is the statistic for describing film style. I have argued that the median shot length is better than the mean shot length for describing film style, and should therefore be preferred for the following reasons:
- the median is conceptually simple and easy to calculate, and is certainly no more difficult than the mean.
- the median shot length has a clearly defined meaning and the difference between two median shot lengths is also meaningful, whereas the meaning of the mean the difference between two mean shot lengths is not clear in either case (and seem to change every time I raise an objection against them).
- the median shot length is not affected by a monotone transformation (the median of a data set is the same as the median of the logarithmic transformation of a data set), while the possibilities for confusing the arithmetic and geometric means are endless.
- the median locates the centre of a distribution irrespective of its shape, whereas this is not true of the mean.
- the median is not affected by outliers or extreme values (however you choose to define them), whereas this is not true of the mean.
- interpretations of film style based on the median shot length are consistent with graphical methods and (it turns out) with dominance statistics (Cliff’s d, HLΔ), while those based on the mean shot length are not.
But I have always argued that it is important use a range of statistical methods to get a full understanding of the nature of film style.
As far as I am aware I am the only person writing about film style to even consider the dispersion of shot lengths in a motion picture and the appropriate methods to use this. I am also the only person to use a range of graphical methods (probability plots, boxplots, empirical cumulative distribution functions, kernel densities, order structure matrices, running Mann-Whitney Z statistics, rank-frequency plots) to describe film style. I am the only person in film studies to employ confidence intervals, statistical hypothesis tests, effect sizes, or even to describe the methodologies I use in studying film style. (Others working outside films studies in disciplines where quantitative methods are commonplace also use such tools as a matter of routine, and those within film studies would do well learn by their example).
I am also the only person who has attempted to describe these methods so that others may try to analyse film style for themselves. I am the only person who has brought to the attention of researchers in film studies the availability of free learning resources and software for statistics. I am the only person to look outside film studies for empirical research on film style and to bring it to the attention of film scholars. I am the only person to address the issue of statistical literacy in film studies (here and here).
Baxter writes that
the accessibilty of computational power, and essential simplicity of important statistical ideas (however mathematically complex) is a hobby-horse of sorts.
I am glad to hear this, because it means that if someone else is prepared to devote some time and effort to explaining statistical concepts and methods to film scholars then I won’t have to do it on my own.
However, as Baxter presents the argument I am interested in the median shot length only while Barry Salt apparently does not have a narrow attachment to a particular statistic of film style and embraces a pluralistic approach. However, I am not aware of any forum in which Salt has made any concession to his view that the mean shot length is the only appropriate statistic of film style. In fact, I am unaware of any other statistics of film style used by Salt besides the average shot length and the histogram (while his odd comments on the calculation of kernel density estimates indicates he may not properly understand other methods).
Baxter has his argument back to front here: you won’t find methodological ecumenism in the statistical analysis of film style in the work of Barry Salt.