Using box plots to analyse film style

Numerical descriptions of film style are valuable but it is often simpler and more informative to use graphical representations of shot length data to aid us in analysing film style. Following on from earlier posts on using kernel densities (here) and cumulative distribution functions (here) this post rounds out this short series by looking at box plots and vioplots. Potter (2006) provides a detailed survey of the methodology of constructing and interpreting box plots and a discussion of extensions and alternatives.

Box-plots are an excellent method for conveying a large amount of information about a data set quickly and clearly, and do not require any prior assumptions about the distribution of the data. Analysing the box-plots of shot lengths in motion pictures we compare the centre and variation of the data, and identify the skew and the presence of outliers. They are also an efficient method of comparing multiple data sets, and placing the box-plots for two or more films side-by-side allows us to directly compare the centre and variation of shot length distributions in intuitively.

The box plot provides a graphical representation of the five-number summary, which includes the minimum value, the lower quartile, the median, the upper quartile, and the maximum value of a data set. The core of the data is defined by the box, which covers the distance between the lower and upper quartiles (i.e. the IQR), and the horizontal line within the box represents the median value of the data. The inner fences are marked by error bars extending from the box, and data points beyond these limits are classed as outliers. An outlier is defined as greater than Q3 + (IQR × 1.5) and an extreme outlier as greater than Q3 + (IQR × 3). Typically, there are no outliers at the low-end of a shot length distribution, and the error bar descends to the value of the shortest shot in a film.

To illustrate, Table 1 presents the descriptive statistics for the three main ITV news bulletins broadcast on 10 August 2011. There is nothing wrong with this information, and we can see immediately that these bulletins have similar styles. They have similar medians indicating they are cut equally quickly, whilst the lunchtime bulletin has slightly more variation of the middle 50 per cent than the other two bulletins. We can also see that the distributions of shot lengths in these films are asymmetric and that the maximum values are much longer than other shots. However, we cannot tell if these maximums are isolated outliers or if there are a large number of such values.

Table 1 Descriptive statistics of ITV news bulletins broadcast on 10 August 2011

Figure 1 presents the box plots of these bulletins, and gives us some of the detail we are looking for. We can see the same information we get in Table 1, but it is easier to make the comparisons across a single scale than to try to imagine the distribution froma set of numbers. We can also see that these bulletins share some other features – the error bars extend a similar distance from the upper quartile with shots in this range (10-18 seconds) associated with short interviews with members of the public, while the clusters of outliers that can be seen for each bulletin in the range 18-30 seconds are associated with the news kernel that begins each item and longer interviews as part of a news report. Longer takes occur when a reporter is speaking directly to camera, typically as part of a two-way interview. We can therefore see that similar events in the discourse structure of these news bulletins occupy a similar amount of screen time within the same bulletin and across the bulletins broadcast on the same day. You cannot tell that from the five-number summary. This is a crucial advantage of using graphical methods alongside numerical summaries – they can be used analytically as well as descriptively. You can learn more from a Figure 1 than you can from Table 1, though it would be best to include both in a piece of research since knowing the actual values of the descriptive statistics is useful to the reader.

Figure 1 Box plots of three ITV news bulletins

By using a box plot we can see some of the structure of the data obscured by the five-number summaries. However, one of the problems with box plots is that they flatten out the detail of the distribution in the box and between the box and the ends of the error bars. This can be remedied by combining box plots with a kernel density to produce a vioplot. This has the advantage of making all the information available from these two types of plots in a single figure. Figure 2 presents the vioplots of these bulletins.

Figure 2 Box plots of three ITV news bulletins

From Figure 2 we can see all the detail from the box plots AND we can see that the density of shot lengths in those areas where the box plot provides no detail. For example, the similarities in the 10-18 second range are more apparent in Figure 2 than Figure 1. For an alternative way of combining box plots and kernel densities to describe these data sets see here.

It has become increasingly common for film scholars to cite average shot lengths, but this information is rarely useful to the reader. It is usually the wrong average, is unaccompanied by a measure of dispersion, and simply does not provide enough information for anyone to make a sensible judgement about the nature of a film’s style. If you do want to use statistics to make a point about film style then please include kernel densities, cumulative distribution functions, or box/vioplots so that we can see what you are talking about. This should be standard practice in research and publishing in film studies.

References

Potter K 2006 Methods for presenting statistical information: the box plot, in H Hagen, A Kerren, and P Dannenmann (eds.) Visualization of Large and Unstructured Data: Lecture Notes in Informatics GI-Edition S-4: 97–106.