# Testing normality in cinemetrics

A key indicator of film style is the distribution of shot lengths in a motion picture, which may be used to identify similarities and differences in the style of individual filmmakers, historical periods, genres, and national cinemas. Shot length distributions are typically characterised by two features: (1) they are positively skewed, and (2) they have a number of outlying data points. Consequently, the assumption of a normal distribution for parametric statistical tests cannot be met; while the positive skew of the data suggests that shot lengths may be log normally distributed. The probability plot correlation coefficient is used as a test statistic of normal and log normal distributions for three films directed by Charles Chaplin to determine if the assumption of a log normal distribution of shot lengths in motion pictures is valid.

## Probability plot correlation coefficient

Parametric statistical tests assume an underlying distribution specified by one or more parameters (such as the mean and the standard deviation), and where this assumption is violated the results of such tests will be unreliable due to a loss of statistical power (Yu 2002). It is therefore necessary to test if such an assumption is valid before proceeding to analyse the data. The probability plot correlation coefficient (PPCC) is a test statistic of the linearity of the relationship between two variables, and can be used to test for both normal and log normal distributions (see Filliben 1975; Looney and Gulledge 1985). The null hypothesis for the PPCC test is that the data are normally distributed, and the PPCC test statistic is

where *X* and *Y* are observed and expected paired values, and *x-bar* and *y-bar* are the means of the observed and expected values. Where PPCC = 1 data is perfectly normally/log normally distributed, while PPCC = 0 indicates no correlation. The PPCC is compared to a critical value for a specified level of significance (α) and sample size (*n*). If the PPCC is less than the critical value, the null hypothesis that the data is normally/log normally distributed is rejected. Lookup tables typically give values for sample sizes for *n*=3 to *n*=50, and then at intervals of 5, 10, and 50; but approximate critical values for *n* are given by

The PPCC provides both a quantitative and graphical representation of goodness-to-fit. To produce a probability plot, the order statistics of the observed values (or the transformed order statistics) are plotted against an inverse function of the plotting position given by

where *i* is the rank of the ordered value. If the data is from a normal or log normal distribution with a PPCC near 1, the probability plot of the ordered values will be an approximately straight line and so the linearity of the probability plot is a good indicator of distributional fit. Where data is from an alternative distribution, it will produce a curved probability plot.

## The distribution of shot lengths in the films of Charles Chaplin

In order to test the validity of assuming a log normal distribution for shot lengths, three films written and directed by Charles Chaplin – *The Rounders* (1914), *A Night Out* (1915), and *The Immigrant* (1917) – were selected from the cinemetrics database (Leipa 2006a, 2006b; O’Brien 2008). As the films are shorts, samples were not drawn and the data is uncensored. The distribution of shot lengths in all three films is positively skewed and each film has a number of outlying data points (see Table 1).

Probability plots were constructed and the corresponding PPCCs were determined for shot length data for each film using the process outlined in Jacobs and Dinman (2004). Shot length data was collected and rank-ordered within each data set. An expected standard normal score (z-score) for each shot length was calculated from the inverse standard normal distribution function for a given plotting position of each shot length. The paired data (expected z-score, shot length) was then plotted on a graph, with a linear trend line fitted onto the data. This process was performed on untransformed shot length data (*Xi*) and on the common logarithm (log10(*Xi*)) of the data. The PPCC and approximate critical values for rejection (α=0.05) are reported in Table 1.

**Table 1** Summary of three films directed by Charles Chaplin

As expected, Table 1 shows that none of Chaplin’s films are normally distributed. *The Rounders* and *A Night Out* are log normally distributed, but *The Immigrant* is not log normally distributed. These distributions can be clearly identified in the probability plots for each film using untransformed data (Figures 1a, 2a, and 3a) and the common logarithm of the data (Figures 1b, 2b, and 3b).

**Figure 1a** Probability plot of shot length data (*Xi*) for *The Rounders* (1914)

**Figure 1b** Probability plot of shot length data (log10(*Xi)*) for *The Rounders* (1914)

**Figure 2a** Probability plot of shot length data (*Xi*) for *A Night Out* (1915)

**Figure 2b** Probability plot of shot length data (log10(*Xi)*) for *A Night Out* (1915)

**Figure 3a** Probability plot of shot length data (*Xi*) for *The Immigrant* (1917)

**Figure 3b** Probability plot of shot length data (log10(*Xi)*) for *The Immigrant *(1917)

As both films are log normally distributed, parametric statistical tests could be used to analyse the distributions of *The Rounders* and *A Night Out*. However, we could not analyse *The Immigrant* in the same way as the assumption of the log normal distribution of data is not met. Due to the violation of this requirement, applying parametric tests to the distribution of shot lengths in this film will produce misleading results. Specifically, parametric tests will not be powerful enough to describe the distribution of *The Immigrant* and the probability of a failing to detect a difference where one exists (Type II error) is increased. Where the data does not fit a theoretical distribution, *nonparametric* statistical tests should be used. Nonparametric tests require fewer assumptions about the data and as they do not rely on the underlying distribution they are often referred to as *distribution-free* (see Gibbons 1993). Nonparametric tests can be applied to all distributions (including log normal) and rather than use parametric tests for some films and nonparametric tests for others, it is better to use nonparametric tests in all cases. An analysis of Chaplin’s films that required two sets of statistical tests depending on which films were being analysed by any particular test would not produce results that allowed the distribution of shot lengths in all films to be compared with one another, and the conclusions drawn from such analysis would not be credible. Some nonparametric tests for the analysis of shot length distributions are listed in Table 2.

**Table 2** Some nonparametric statistical tests for shot length distributions

Salt (2006: 389-396) makes a similar argument regarding the log normality of shot length distributions using the *coefficient of determination* (*R-squared*) to test goodness-of-fit. For simple linear regression, *R-squared* is the square of the correlation coefficient and indicates the proportion of the variance of the distribution of shot lengths that is predicted by the theoretical log normal distribution. (In Figures 1a-3b, the fit of the linear trend line to the data is described by *R-squared*). Salt concludes that some films are log normally distributed while others are not, and this is confirmed by the results in Table 1. He does not make any argument regarding the use of parametric and/or nonparametric tests in cinemetrics where the assumption of log normality is not met.

## Conclusion

Parametric statistical tests assume that sample data is drawn from an underlying distribution. Shot length data for motion pictures is typically not normally distributed, although in some cases it may be log normally distributed. This is not the case for all films (even though the data is positively skewed), and so the assumption of a log normal distribution is not universally valid. Taking into account the variability of shot length distributions, it is recommended that nonparametric tests that make no assumptions about the distribution of data are appropriate in analysing film style.

## References

Filliben J.J. (1975) The probability plot correlation coefficient test for normality, *Technometrics* 17 (1): 111-117.

Gibbons, J.D. (1993) *Nonparametric Statistics: An Introduction*. Newbury Park, CA: Sage.

Jacobs, J.L. and Dinman, J.D. (2004) Systematic analysis of bicistronic reporter assay data, *Nucleic Acids Research* 32 (20): e160.

Leipa, T. (2006a) *The Rounders*, Cinemetrics Database, http://www.cinemetrics.lv/movie.php?movie_ID=306, accessed 19 November 2008.

Leipa, T. (2006b) *A Night Out*, Cinemetrics Database, http://www.cinemetrics.lv/movie.php?movie_ID=254, accessed 19 November 2008.

Looney, S.W., and Gulledge, T.R. (1985) Use of the correlation coefficient with normal probability plots, *The American Statistician* 39 (1): 75-79.

O’Brien, C. (2008) *The Immigrant*, Cinemetrics Database, http://

http://www.cinemetrics.lv/movie.php?movie_ID=1055, accessed 9 December 2008.

Salt, B. (2006) *Moving into Pictures: More on Film History, Style, and Analysis*. London: Starwood.

Yu, C.H. (2002) An overview of remedial tools for violations of parametric test assumptions in the SAS system, *Proceedings of 2002 Western Users of SAS Software Conference*. Cary, NC: SAS Institute, Inc.: 172-178. Available online: http://www.creative-wisdom.com/pub/parametric_WUSS2002.pdf, accessed 10 December 2008.

Posted on February 19, 2009, in Charles Chaplin, Cinemetrics. Bookmark the permalink. 4 Comments.

Since two of these films are lognormally distributed (by your criterion), and the other one is pretty close, I think it needs a bigger sample for you to make any assertions about the generality of shot length distributions.

Why don’t you check a bigger sample from the Cinemetrics database?

I am guessing that you have access to an expensive statistics package that you can pour them into. (I don’t.)

My sample in “The Numbers Speak” is not anything like a representative sample, since it contains a much too large proportion of long take films.

A full set of reliable (though not frame-accurate) early Chaplins (Keyston, Essanay, Mutual) are now available at Cinemetrics.lv

NB your bibliography: Torey Liepa, not Leipa

Pingback: Some notes on cinemetrics « Research into film

Pingback: Expanded sample for lognormal distribution « Research into film