# Using the ECDF to analyse film style

Last month I looked at using kernel densities to analyse film style, and to follow-up this week’s post will focus on another simple graphical method for understanding film style: the empirical cumulative distribution function (ECDF).

Although it has a grand sounding name this is a very simple method for getting a lot of information very quickly. Most statistical software packages will calculate the ECDF for you and draw you a graph, but it is very simple to create an EXCEL or CALC spreadsheet to do this since it does not require any special knowledge.

The ECDF gives a complete description of a data set, and  is simply the fraction of a data set less than or equal to some specified value. Several plotting positions for the ECDF have been suggested, but here we use the simplest method:

which means that you count the number of shots (x) less than or equal to some value (X), and then divide by the sample size (N). Do this for every value of x in your data set and you have the ECDF. We can interpret this fraction in several ways: we can think of it as the probability of randomly selecting an x less than or equal to X (P[x ≤ X]); or we can think of it as the proportion of values less than or equal to X; or, if we multiply by 100, the percentage of values in a data set less than or equal to X.

For example, using the data set for Easy Virtue (1928) from the Cinemetrics database available here we can calculate the ECDF as illustrated in Table 1.

Table 1 Calculating the ECDF for Easy Virtue (1928) (N = 706)

To start, look at the value of X in the first column and then count the number of shots in the film with length less than or equal to that value. The first value is 0.9 but there are no shots this short in the film and so the frequency is zero. Divide this zero by the number of shots in the film (i.e. 706) and you have the ECDF when X = 0.9, which is 0 (because 0 divided by any number is always 0). Next, X = 1.0 seconds and there is 1 shot less than or equal to this value and so the ECDF at X = 1.0 is 1/706 = 0.0014. Turning to X = 1.1 we see there are three shots that are 1.1 seconds long AND there is one shot that is shorter in length (i.e. the one at 1.0s), and so the ECDF at X = 1.1 is 4/706 = 0.0057. This is equal to the frequency of 1.0 second long shots divided by N (0.0014) PLUS the frequency of shots that are 1.1 seconds long (3/706 = 0.0042) – and that is why it’s called the cumulative distribution function. From this point you keep going until to reach the end: the longest shot in the film is given as 66.6 seconds long and so all 706 shots must be less than or equal to 66.6 seconds and so at this value of X the ECDF = 706/706 = 1.0. The ECDF is 1.0 for any value of X greater than the maximum x in the data set.

It really is this easy. And you can get a simple graph of F(x) by plotting x on the x-axis and the ECDF on the y-axis. More usefully, you can plot the ECDFs of two or more films on the same graph so that you can compare their shot length distributions. Figure 1 shows the empirical cumulative distribution functions of Easy Virtue and The Skin Game (1931 – access the data here).

Figure 1 The empirical cumulative distribution functions of Easy Virtue (1928) and The Skin Game (1931)

Now clearly there is a problem with this graph: because the shot length distribution of a film is positively skewed all the shots are bunched up on the left-hand side of the plot and you cannot see any detail. This can be resolved by redrawing the x-axis on a logarithmic scale, which stretches out the bottom end of the data which has all the detail and squashing the top end which has only a few data points. This can be seen in Figure 2.

Figure 2 The empirical cumulative distribution functions of Easy Virtue (1928) and The Skin Game (1931) on a log-10 scale

These two graphs present exactly the same information, but at least in Figure 2 we can find the information we want. In transforming the x-axis we have not assumed the shot length distribution of either film follows a lognormal distribution – which is just as well because this is obviously not true for either film.

Now what can we discover about the editing in these two films?

First, it is clear that these two films have same median shot length because the probability of randomly selecting a shot less than or equal to 5.0 seconds is 0.5 in both films. The definition of the median shot length is the value that divides a data set in two so that half are less than or equal to x and greater than or equal to x (i.e P(x ≤ X) = 0.5. We might therefore conclude that they have the same style. However, these two films clearly have different shot length distributions and it is easier to appreciate this when we combine numerical descriptions with a plot of the actual distributions.

A basic rule for interpreting the plot of ECDFs for two films is that if the plot for film A lies to the right of the plot for film B then film A is edited more slowly. Obviously this is not so clear cut in Figure 2.

Below the median shot length, the ECDF of The Skin Game lies to the left of that of Easy Virtue indicating that at those shot lengths it has a greater proportion of shots at the low-end of the distribution: for example, 25% of the shots in The Skin Game are less than or equal to 2.0 seconds in length compared to just 6% of the shots in Easy Virtue. This would seem to indicate that The Skin Game is edited more quickly than Easy Virtue. At the same time we see that above the median shot length that the ECDF of The Skin Game lies to the right of that of Easy Virtue indicating that it has a lower proportion of shots at the high-end of the distribution: for example, 75% of the shots in Easy Virtue are less than or equal to 8.3 seconds compared to 66% of the shots in The Skin Game. This would appear to suggest that The Skin Game is edited more slowly than Easy Virtue. Clearly there is something more interesting going on than indicated by the equality of the medians, and the answer lies in how spread out the shot lengths of these two films. The ECDF of Easy Virtue is very steep and covers only a limited range of values, where as the ECDF of The Skin Game covers a much wider range of shot lengths. The interquartile range of Easy Virtue is 5.2 seconds (Q1 = 3.1s, Q2 = 8.3s) indicating the shot lengths of this film are not widely dispersed; while the IQR of The Skin Game is 12.7s (Q1 = 2.0s, Q3 = 14.7s).

This example is an excellent demonstration of why it is important to always provide a measure of the dispersion of a data set when describing film style. It is not enough to only provide the average shot length since two films may have the same median shot length and completely different editing styles. See here for a discussion of appropriate measures of scale that can be used. It should be standard practice that an appropriate measure of dispersion is cited along with the median shot length for a film by any researcher who wants to do statistical analysis of film style, and journal editors and/or book publishers who receive work where this is not the case should send it back immediately with a note asking for a proper description of a film’s style. If you don’t include any description – either numerical or graphical – of the dispersion of shot lengths in a film then you haven’t described your data properly.

We can also use the ECDFs for two films to perform a statistical test of the null hypothesis that they have the same distribution. This is called the Kolmogorov-Smirnov (KS) test, and the test statistic is simply the maximum value of the absolute differences between the ECDF of one film (F(x)) and the ECDF of another film (G(x)) for every value of x. The ‘absolute difference’ means that you subtract one from the other and then take only size of the answer and ignore the sign (i.e. ignore if its positive or negative):

Table 2 shows this process for the two films in Figures 1 and 2.

Table 2 Calculating the Kolmogorov-Smirnov test statistic for the ECDFs of Easy Virtue (1928) and The Skin Game (1931)

In the first column in Table 2 we have the lengths of the shots from the smallest in the two films (0.6 seconds) to the longest (174.7 seconds), and then in columns two and three we have the ECDF of each film. Column four is the difference between the ECDFs of the two films, subtracting the ECDF of The Skin Game from the ECDF of Easy Virtue for every x: so when x = 0.6, we have 0-0.0037 = -0.0037. The final column is the absolute difference, which is just the size of the value in the fourth column and the sign is ignored: the absolute value of -0.0037 is 0.0037. Do this for every value of x and find the largest value in the final column.

In the case of these two films the maximum absolute difference occurs when x = 2.0 and is statistically significant (p < 0.01). Therefore we conclude these two films have different shot length distributions. (You may find that different statistics software give slightly different answers to this depending on the plotting position used).

An online calculator for the KS-test that will also draw a plot of the ECDFs can be accessed here, and is accompanied by a very useful explanation. (NB: this only works for data sets up to N = 1024). Rescaling the x-axis of our plot of the two ECDFs does not affect the KS-test since the ECDFs are on the y-axis and D column in Table 2 is the vertical difference between them.

(There is also a one-sample of the KS-test for comparing a single distribution to a theoretical distribution to determine goodness-of-fit, but there are so many other methods that do exactly the same thing better that it’s not worth bothering with).

The ECDF is very easy to calculate, the graph is very easy to produce and provides a lot of information about a data set for every little effort, and the KS-test is also a very simple way of comparing two data sets. There is no bewildering mathematics involved: just count, divide, add, subtract, and ignore. The statistical analysis of film style really is this easy.