# Category Archives: Charles Chaplin

## Estimating shot length distributions

One of the problems we encounter when researching film style is that different versions of the same film exist. For example, the discovery in Argentina in 2008 of a version of Fritz Lang’s *Metropolis* (1927) that was approximately 25 minutes longer than previously known versions. The official site for the restored version is here. This makes the statistical analysis of film style difficult, because we have to face the fact that the version of a film we are analysing may not be the film as it was produced.

We may come across different versions of different films for a variety of reasons:

- Different versions of the same film may be released in different countries.
- ‘Director’s cut’ versions raise the question as to what we should call the definitive version: I have five different versions of
*Bladerunner*(Ridley Scott, 1982) on DVD – the original 1982 US theatrical release, the 1982 international theatrical release, the work print, and the 1992 Director’s Cut version, and the 2007 Final Cut. We could simply note which version our data represents and leave it at that. However, we are faced with the problem that if we wish to look at the shot length distributions of Hollywood movies in the early-1980s or the films of Ridley Scott, which version should we pick? Should we pick the version with the voice-over that uses shots left over from*The Shining*, or the one without the voice-over and the unicorn dream sequence? Does the fact that the film was re-edited for the 2007 cut invalidate it when it comes to looking at 1980s Hollywood, even though all the material used in this cut was shot in the early-1980s? Is the Final Cut version an example of Hollywood cinema from the early-1980s, or of the mid-2000s? Did Scott’s editing style change between 1982 and 2007 so that these two versions cannot be simply compared? - The version of a film released for home viewing is often different to the theatrical release due to the requirements of classification boards or censors. Another factor here may be corporate taste: historically, some home cinema outlets have edited their tapes to maintain a family friendly corporate image by removing scenes of gore, violence, and/or sex. Rather than distinguish between different versions we tend to treat the domestic and theatrical releases as being one and the same, when in fact our data may show some discrepancies.
- Although it is unlikely to be a significant factor in the 21st century, pan-and-scan may affect the number of shots in a film.
- When working with silent films it is often difficult to find compete versions of the films, and some frames, shots, scene, or even reels may be missing. We will therefore be working with only a partial data set. This problem can be compounded by the release of restored versions that are built up from several prints. It is highly likely that we have not the seen (and probably never will) the original version of many of the silent films that we take for granted on DVD.

There are also other sources of measurement error that can affect our research:

- When dealing with wipes, dissolves, fades, irises, should we record the end of one shot and the beginning of the next at the beginning of the transition, the end of the transition, or in the middle of the transition. I always prefer the last of these options, and try to identify the middle frame of the edit, but I cannot speak for other researchers.
- Identifying the correct running speed for silent movies is problematic. Silent movies we often shot at 18 or 20 fps, while we view and analyse them at 25/30 fps.
- Data from the Cinemetrics database will also contain errors due to the performance of the researcher, and so the figures quoted will only ever be estimates.

It is wrong to state, as Barry Salt did recently (here), that we do not need to employ the full range of statistical methods and that such methods are ‘misleading’ and ‘irrelevant.’ It is necessary to deal with these issues in order to present the best analysis we can, and that means we need to be able to deal with the error present in our estimates. Even though the data we collect may be accurate to the frame, we will still have to deal with the existence of multiple versions, missing shots, different methods of data collection, etc. If you are going to analyse film style statistically, then at some point you are going to have to do some statistics.

This post focusses on the particular problem of dealing with silent films that have been restored. In 2009 and 2010 I added two posts to this blog looking at the shot length distributions of the Keystone films starring Charlie Chaplin (here and here). Since then, the BFI has released its *Chaplin at Keystone* DVD (see here). How do the shot length distributions of these restored versions compare to the original data I used in 2009? (For ease of understanding, when I refer to ‘original’ I mean the 2009 data and when I refer to ‘restored’ I mean data derived from the BFI DVD).

To date I have only looked at data from four films, though I hope to get around to the rest some time later this year. The four films are *The Masquerader*, *The Rounders*, *The New Janitor*, and *Getting Acquainted*. In the original data I removed the credit titles *and* the expository and dialogue title, and for the sake of consistency I have done so here with the restored versions. However, in the Excel file at the end of this post that includes the shot length data from the restored versions of these four films I have excluded the credit titles but left in the other titles as indicated by ‘T.’

The descriptive statistics of the original and restored versions of *The Masquerader* are presented in Table 1 and the empirical cumulative distribution functions are presented in Figure 1.

**Table 1** Descriptive statistics of the original and restored versions of *The Masquerader* (Charles Chaplin, 1914)

From Table 1 we can see that the original estimate of the median shot length (3.7s [95% CI: 2.8, 4.6]) is consistent with the revised estimate (4.5s [95% CI: 2.7, 6.3]). However, there is a large difference in the dispersion of shot lengths as indicated by the increase in the upper quartile and the interquartile range. This indicates that the version of *The Masquerader* from which the original data is less consistent in the upper part of the distribution, although a two-sample Kolmogorov-Smirnov test indicates there is no statistically significant difference (*D* = 0.1485, *p* = 0.265).

**Figure 1** Empirical cumulative distribution functions of shot lengths in the original and restored versions of *The Masquerader* (Charles Chaplin, 1914)

Looking at the same information for *The Rounders* (Table 2 and Figure 2), we note that there is a much larger discrepancy between the two versions of this film. The original estimate of the median shot length was 3.6s (95% CI: 2.5, 4.7), and the revised estimate is 5.0s (95% CI: 3.5, 6.5). Again there is a larger increase in the dispersion of shot lengths, and this is also more marked in the upper part of the distribution. Again, we find that a two-sample Kolmogorov-Smirnov test indicates there is no statistically significant difference between the two distribution functions (*D* = 0.1922, *p* = 0.087).

**Table 2** Descriptive statistics of the original and restored versions of *The Rounders* (Charles Chaplin, 1914)

**Figure 2** Empirical cumulative distribution functions of shot lengths in the original and restored versions of *The Rounders* (Charles Chaplin, 1914)

There are no such large differences between the versions of *The New Janitor* (Table 3 and Figure 3). The medians are consistent, with only a small change in the estimate from 3.5s (95% CI: 2.4, 4.5) to 4.2s (95% CI: 3.2, 5.1). There is also a small increase in the interqaurtile range, and this is accounted for by the small difference between the upper quartiles. However, this difference is not comparable to those observed in the cases of* The Masquerader* and *The Rounders*, and the cumulative distribution functions are indicates that the two versions have the same distribution of shot lengths (Kolmogorov-Smirnov: *D* = 0.1184, *p* = 0.515).

**Table 3** Descriptive statistics of the original and restored versions of *The New Janitor* (Charles Chaplin, 1914)

**Figure 3** Empirical cumulative distribution functions of shot lengths in the original and restored versions of *The New Janitor* (Charles Chaplin, 1914)

The two versions of *Getting Acquainted* (Table 4 and Figure 4) show only a small difference in the upper quartile and the interquartile range, but otherwise the two sets of shot length data are consistent (Kolmogorov-Smirnov: *D* = 0.0622, *p* = 0.978). The original estimate of the median is 3.9s (95% CI: 3.3, 4.5) and the revised estimate is 4.0s (95% CI: 3.3, 4.7), so these are nearly identical.

**Table 4** Descriptive statistics of the original and restored versions of *Getting Acquainted* (Charles Chaplin, 1914)

**Figure 4** Empirical cumulative distribution functions of shot lengths in the original and restored versions of *Getting Acquainted* (Charles Chaplin, 1914)

Although I have looked at just four films here we can see that generally the difference in the median shot lengths is small for three of the films and would not substantially change how we interpret this information – though the increase in the dispersion of the upper part of the distribution for the restored version of *The Masqueraders* is a good example of why it is not enough to refer only to measures of location in the analysis of film style. We must also look at dispersion. The difference between the two versions of *The Rounders* will obviously lead us to reconsider our conclusions based on this data. Hopefully when I have finally completed transcribing the data for the other Chaplin Keystones from the restored version a clearer understanding of how to deal with different estimates of the shot elngths in a motion picture will emerge.

The shot length data for the restored versions of *The Masquerader*, *The Rounders*, *The New Janitor*, and *Getting Acquainted* can be accessed as an Excel 2007 (.xlsx) here: Nick Redfern – BFI Restored Chaplin 1. This data was collected by loading the films into Magix Movie Edit Pro 14 at 25 fps, and has been corrected by multiplying each shot length by 25/24.

## Shot length distributions in the early films of Charles Chaplin

Towards the end of 2008 I wrote this short piece comparing the shot lengths of four films directed by Charles Chaplin, and submitted it to *In Short*, an online journal at the University of Miami, where it was accepted for publication. Like many online journals, *In Short* appears to have contributed more to the CVs of those on its editorial board than it has to scholarship and its website has now disappeared without any communication as to when (or if) anything will published or any response to my queries as to what has happened. So to put this piece out into the public domain I included it here, and as usual you can download the pdf file while the abstract is below: Nick Redfern – Shot length distributions in the early films of Charles Chaplin

## Abstract

The distribution of shot lengths in a motion picture is an indicator of film style, and is typically positively skewed with a number of outlying data points. Consequently, assumptions about the distribution of data for parametric statistics cannot be met and nonparametric tests are preferred for analysing quantifiable aspects of film style. This study uses nonparametric statistics as a method of comparing the distribution of shot lengths in motion pictures. Four films directed by Charles Chaplin from 1914 and 1915 were analysed to determine if the distribution of shot lengths was consistent in the works of a single director over time. Two sample Kolmogorov-Smirnov tests failed to identify a significant difference in films directed by Chaplin in the same year, but did identify significant differences in films directed by Chaplin in different years. These results may be accounted for by Chaplin’s move from the Keystone Film Company to the Essanay Film Manufacturing Company, suggesting that studio is a determining factor in film style at this stage of Chaplin’s career.

I came across a useful paper on interpreting graphs such as the one I use in the above paper, and this is worth reading: Herman Callaert, Nonparametric hypotheses for the two-sample location problem, *Journal of Statistics Education* 7 (2) 1999: http://www.amstat.org/publications/jse/secure/v7n2/callaert.cfm.

I’ve also just noted that there is a paper on the use of non-parametric tests in latest issue of the same journal: dwayne R Derryberry, Sue B Schou, and WJ Connover, Examples: Teaching rank-based tests by emphasizing structural similarities to corresponding parametric tests, *Journal of Statistics Education *18 (1) 2010: www.amstat.org/publications/jse/v18n1/derryberry.pdf.

## Shot Length Distributions in the Chaplin Keystones

This week I have another draft of a Cinemetrics paper, this time looking at shot length distributions in Keystone films starring Charles Chaplin and directed by Chaplin, Mack Sennett, Mabel Normand, George Nichols, and Henry Lehrman. You can download the pdf here: Nick Redfern – Shot Length Distributions in the Chaplin Keystones, and the abstract is given below.

Cinemetrics provides an objective method by which the stylistic characteristics of a filmmaker may be identified. This study uses shot length distributions as an element of film style in order to analyse the films by five directors featuring Charles Chaplin for the Keystone Film Company. A total of 17 Keystone films are analysed – six directed by Chaplin himself, along with others directed by Henry Lehrman, George Nichols, Mabel Normand, and Mack Sennett. Shot length data was collected for each film and then combined to create data sets based on the studio style and for each director. The results show that for the distribution of shot lengths in Keystone films starring Chaplin (1) there is no significant difference between films directed Chaplin and the overall Keystone model; (2) there is no significant difference between Chaplin’s films and those of Lehrman, Nichols, and Sennett; (3) there is a significant difference between the films of Normand and the Keystone model but the effect size is small; and (4) there is a significant difference between Normand and the other Keystone filmmakers but the effect size of these differences is again small. This study shows that the distribution of shot lengths can be used to identify how the style of an individual filmmaker relates to a larger group style; and that, in the specific case of the Keystone Film Company, it is the studio style of fast-paced, slapstick comedy that determines the distribution of shot lengths with little variation present in the films of individual filmmakers.

As before, any comments and suggestions are welcome (as is the pointing out of glaring errors).

The raw data was collectde by examining the films frame by frame in my editing software, and can be accessed in a Microsoft Word Document here:

For Microsoft Word 97-2003 (x.doc): Nick Redfern – Shot length distributions in the Chaplin Keystones – data

For Microsfoft Word 2007 (x.docx): Nick Redfern – Shot length distributions in the Chaplin Keystones – data

## Testing normality in cinemetrics

A key indicator of film style is the distribution of shot lengths in a motion picture, which may be used to identify similarities and differences in the style of individual filmmakers, historical periods, genres, and national cinemas. Shot length distributions are typically characterised by two features: (1) they are positively skewed, and (2) they have a number of outlying data points. Consequently, the assumption of a normal distribution for parametric statistical tests cannot be met; while the positive skew of the data suggests that shot lengths may be log normally distributed. The probability plot correlation coefficient is used as a test statistic of normal and log normal distributions for three films directed by Charles Chaplin to determine if the assumption of a log normal distribution of shot lengths in motion pictures is valid.

## Probability plot correlation coefficient

Parametric statistical tests assume an underlying distribution specified by one or more parameters (such as the mean and the standard deviation), and where this assumption is violated the results of such tests will be unreliable due to a loss of statistical power (Yu 2002). It is therefore necessary to test if such an assumption is valid before proceeding to analyse the data. The probability plot correlation coefficient (PPCC) is a test statistic of the linearity of the relationship between two variables, and can be used to test for both normal and log normal distributions (see Filliben 1975; Looney and Gulledge 1985). The null hypothesis for the PPCC test is that the data are normally distributed, and the PPCC test statistic is

where *X* and *Y* are observed and expected paired values, and *x-bar* and *y-bar* are the means of the observed and expected values. Where PPCC = 1 data is perfectly normally/log normally distributed, while PPCC = 0 indicates no correlation. The PPCC is compared to a critical value for a specified level of significance (α) and sample size (*n*). If the PPCC is less than the critical value, the null hypothesis that the data is normally/log normally distributed is rejected. Lookup tables typically give values for sample sizes for *n*=3 to *n*=50, and then at intervals of 5, 10, and 50; but approximate critical values for *n* are given by

The PPCC provides both a quantitative and graphical representation of goodness-to-fit. To produce a probability plot, the order statistics of the observed values (or the transformed order statistics) are plotted against an inverse function of the plotting position given by

where *i* is the rank of the ordered value. If the data is from a normal or log normal distribution with a PPCC near 1, the probability plot of the ordered values will be an approximately straight line and so the linearity of the probability plot is a good indicator of distributional fit. Where data is from an alternative distribution, it will produce a curved probability plot.

## The distribution of shot lengths in the films of Charles Chaplin

In order to test the validity of assuming a log normal distribution for shot lengths, three films written and directed by Charles Chaplin – *The Rounders* (1914), *A Night Out* (1915), and *The Immigrant* (1917) – were selected from the cinemetrics database (Leipa 2006a, 2006b; O’Brien 2008). As the films are shorts, samples were not drawn and the data is uncensored. The distribution of shot lengths in all three films is positively skewed and each film has a number of outlying data points (see Table 1).

Probability plots were constructed and the corresponding PPCCs were determined for shot length data for each film using the process outlined in Jacobs and Dinman (2004). Shot length data was collected and rank-ordered within each data set. An expected standard normal score (z-score) for each shot length was calculated from the inverse standard normal distribution function for a given plotting position of each shot length. The paired data (expected z-score, shot length) was then plotted on a graph, with a linear trend line fitted onto the data. This process was performed on untransformed shot length data (*Xi*) and on the common logarithm (log10(*Xi*)) of the data. The PPCC and approximate critical values for rejection (α=0.05) are reported in Table 1.

**Table 1** Summary of three films directed by Charles Chaplin

As expected, Table 1 shows that none of Chaplin’s films are normally distributed. *The Rounders* and *A Night Out* are log normally distributed, but *The Immigrant* is not log normally distributed. These distributions can be clearly identified in the probability plots for each film using untransformed data (Figures 1a, 2a, and 3a) and the common logarithm of the data (Figures 1b, 2b, and 3b).

**Figure 1a** Probability plot of shot length data (*Xi*) for *The Rounders* (1914)

**Figure 1b** Probability plot of shot length data (log10(*Xi)*) for *The Rounders* (1914)

**Figure 2a** Probability plot of shot length data (*Xi*) for *A Night Out* (1915)

**Figure 2b** Probability plot of shot length data (log10(*Xi)*) for *A Night Out* (1915)

**Figure 3a** Probability plot of shot length data (*Xi*) for *The Immigrant* (1917)

**Figure 3b** Probability plot of shot length data (log10(*Xi)*) for *The Immigrant *(1917)

As both films are log normally distributed, parametric statistical tests could be used to analyse the distributions of *The Rounders* and *A Night Out*. However, we could not analyse *The Immigrant* in the same way as the assumption of the log normal distribution of data is not met. Due to the violation of this requirement, applying parametric tests to the distribution of shot lengths in this film will produce misleading results. Specifically, parametric tests will not be powerful enough to describe the distribution of *The Immigrant* and the probability of a failing to detect a difference where one exists (Type II error) is increased. Where the data does not fit a theoretical distribution, *nonparametric* statistical tests should be used. Nonparametric tests require fewer assumptions about the data and as they do not rely on the underlying distribution they are often referred to as *distribution-free* (see Gibbons 1993). Nonparametric tests can be applied to all distributions (including log normal) and rather than use parametric tests for some films and nonparametric tests for others, it is better to use nonparametric tests in all cases. An analysis of Chaplin’s films that required two sets of statistical tests depending on which films were being analysed by any particular test would not produce results that allowed the distribution of shot lengths in all films to be compared with one another, and the conclusions drawn from such analysis would not be credible. Some nonparametric tests for the analysis of shot length distributions are listed in Table 2.

**Table 2** Some nonparametric statistical tests for shot length distributions

Salt (2006: 389-396) makes a similar argument regarding the log normality of shot length distributions using the *coefficient of determination* (*R-squared*) to test goodness-of-fit. For simple linear regression, *R-squared* is the square of the correlation coefficient and indicates the proportion of the variance of the distribution of shot lengths that is predicted by the theoretical log normal distribution. (In Figures 1a-3b, the fit of the linear trend line to the data is described by *R-squared*). Salt concludes that some films are log normally distributed while others are not, and this is confirmed by the results in Table 1. He does not make any argument regarding the use of parametric and/or nonparametric tests in cinemetrics where the assumption of log normality is not met.

## Conclusion

Parametric statistical tests assume that sample data is drawn from an underlying distribution. Shot length data for motion pictures is typically not normally distributed, although in some cases it may be log normally distributed. This is not the case for all films (even though the data is positively skewed), and so the assumption of a log normal distribution is not universally valid. Taking into account the variability of shot length distributions, it is recommended that nonparametric tests that make no assumptions about the distribution of data are appropriate in analysing film style.

## References

Filliben J.J. (1975) The probability plot correlation coefficient test for normality, *Technometrics* 17 (1): 111-117.

Gibbons, J.D. (1993) *Nonparametric Statistics: An Introduction*. Newbury Park, CA: Sage.

Jacobs, J.L. and Dinman, J.D. (2004) Systematic analysis of bicistronic reporter assay data, *Nucleic Acids Research* 32 (20): e160.

Leipa, T. (2006a) *The Rounders*, Cinemetrics Database, http://www.cinemetrics.lv/movie.php?movie_ID=306, accessed 19 November 2008.

Leipa, T. (2006b) *A Night Out*, Cinemetrics Database, http://www.cinemetrics.lv/movie.php?movie_ID=254, accessed 19 November 2008.

Looney, S.W., and Gulledge, T.R. (1985) Use of the correlation coefficient with normal probability plots, *The American Statistician* 39 (1): 75-79.

O’Brien, C. (2008) *The Immigrant*, Cinemetrics Database, http://

http://www.cinemetrics.lv/movie.php?movie_ID=1055, accessed 9 December 2008.

Salt, B. (2006) *Moving into Pictures: More on Film History, Style, and Analysis*. London: Starwood.

Yu, C.H. (2002) An overview of remedial tools for violations of parametric test assumptions in the SAS system, *Proceedings of 2002 Western Users of SAS Software Conference*. Cary, NC: SAS Institute, Inc.: 172-178. Available online: http://www.creative-wisdom.com/pub/parametric_WUSS2002.pdf, accessed 10 December 2008.