# Expanded sample for lognormal distribution

I have looked at the assumption of lognormality for shot length distributions in the statistical analysis of film style in some earlier posts here, here, and here. Using the probability plot correlation coefficient, I concluded that as the assumption of lognormality could not be justified in up to half of the films studied that it was not appropriate to assume lognormality in general – if your experiment is based on assumption that is wrong 50% of the time, then your results will not be reliable. This post repeats that analysis presented earlier using a larger sample of Hollywood films. For a description of the method of using normal  probability plots and the probability plot correlation coefficient employed and the earlier results see the links above, or the links to the papers at the end of this post.

In Figure 1, we can see an example of a probability plot for a film (20,000 Years in Sing Sing) for which the data not only failed to reject the null hypothesis of lognormality (see below) but that on visual inspection is about as good a fit as you could expect. Although failure to reject a null hypothesis cannot be taken to imply that such a hypothesis is true, on the basis of this plot you would be more than happy to treat this film as being lognormally distributed. Figure 1 Probability plot of shot length data (LN[X]) for 20,000 Years in Sing Sing (1932) (n = 692, PPCC = 0.9989)

Another useful feature of the normal probability plot is that the slope and the intercept of the regression line provide estimates of the shape factor and mean of the log-transformed data, respectively. In Figure 1, the slope is 0.8245, which is very close to the standard deviation of the log-transformed data (s = 0.8236) and the method proposed by Barry Salt for use in analysing film style based on the ratio of the mean to the median (s* = √(2×LN(mean/median)) = 0.8461). The intercept is 1.5695, giving a geometric mean of 4.7 seconds, which is very close to the median of 4.8s.

In Figures 2 to 4 we have three films for which the null hypothesis of lognormality (see below) was rejected. What is noticeable about these three films is that the deviation from the hypothesised distribution is different in each case. The plot for Rain (Figure 2) is all over the place; while the data for Steamboat Bill, Jr (Figure 3) deviates from the hypothesised distribution in both the lower and upper tails and the curvature of the plot indicates that the logarithmic transformation has not been successful in removing all of the skew from the data. In contrast, A Free Soul (Figure 4) shows such variation only in its lower tail. Figure 2 Probability plot of shot length data (LN[X]) for Rain (1932) (n = 308, PPCC = 0.9768) Figure 3 Probability plot of shot length data (LN[X]) for Steamboat Bill, Jr (1928) (n = 575, PPCC = 0.9839) Figure 4 Probability plot of shot length data (LN[X]) for A Free Soul (1931) (n = 461, PPCC = 0.9873)

We can also see from the different estimates provide by the slope, s, and s*, and intercept and median that discrepancies abound. For Rain, the slope gives a shape factor of 1.3036, s = 1.3287, and s* = 1.5865; while the intercept (1.8959) indicates a geometric mean of 6.7 compared to a median value of 5.1 seconds. For Steamboat Bill, Jr, the slope is 0.7135, compared to s = 0.7233 and s* = 0.8920; while the discrepancy between the geometric mean (5.2 [intercept = 1.6572]) and the median (4.8) is less than was observed for Rain. For A Free Soul the differences in the estimates are smaller: for the shape factor, the slope is 1.0206, s = 1.0305, and s* = 1.0962; and that the geometric mean is 7.1 (intercept = 1.9596) and the median is 6.6 seconds.

Note that in all three cases, it is the method for estimating the shape factor based on the assumed relationship between the median and the mean (s*) that shows the greatest difference from the other methods. This is because the relationship between the median and the mean is only valid if the data is lognormally distributed. If this is not the case, then the claimed relationship between the median and the mean does not exist and produces inaccurate estimates of the parameters for the lognormal distribution. As this assumption is valid for 20,000 Years in Sing Sing we see that s* provides an estimate close to the other methods; but as the assumption of lognormality is not justified for the other three films, it does not. If we based any analysis of these films upon the assumption that their shot lengths were lognormally distributed, then our conclusions would be worthless because that assumption, and everything we derive from it (including the parameters μ and σ), is not true.

As noted above, it appears that the assumption of lognormailty may be justified in only half of the films we look at. Extending this research with a larger sample will allow us to make a better assessment of the applicability of this assumption to shot length distributions. In total, the probability correlation coeffcient test of normality was applied to a total of 168 Hollywood films (including some of the films I had previously looked at), divided into three groups: silent films of the 1920s (n = 52), sound films from 1929 to 1931 (n = 66), and sound films from 1932 to 1934 (n = 50). As these are statistical tests of a null hypothesis it is important to remember that failure to reject the null hypothesis does not mean that the data is lognormally distributed, and that some of these tests will conclude the data is not lognormally distributed when in fact it is. The test was applied using a Blom plotting position and α = 0.05. All the data used is from the Cinemetrics database (here).

Of the silent films produced in the 1920s (Table 1), the hypothesis of lognormality was rejected in 39 of the 52 cases, or 75% of the time. Of the sound films produced between 1929 and 1931 (Table 2), lognormality was rejected in 50 out of 66 cases (76%); and of the sound films from 1932 to 1934 (Table 3), it was rejected in 40 out of 50 cases (80%).

Table 1 Probability plot correlation coefficient test of the null hypothesis (H0) that the data is lognormally distributed for Hollywood films produced in the 1920s (n = 52) Table 2 Probability plot correlation coefficient test of the null hypothesis (H0) that the data is lognormally distributed for Hollywood films produced from 1929 to 1931 (n = 66) Table 3 Probability plot correlation coefficient test of the null hypothesis (H0) that the data is lognormally distributed for Hollywood films produced from 1932 to 1934 (n = 50) What stands out from these results is that the proportion of films for which there is sufficient evidence against the assumption of lognormality is similar for each group of films. The earlier results that indicated that lognormality could not be assumed in half the films now look over-optimistic – the assumption of lognormality may only be justified in between a fifth to a quarter of cases for Hollywood films. Certainly this is a long way off the assumption that lognormality of shot length distributions is generally true. Whether this is true for cinemas in other countries or other eras will have to wait for a later post.

The other thing to stand out is that there is no pattern among the films: we cannot distinguish between short films or features, silent films or sound, films from different genres or different studios, or by decade as being lognormal or not lognormal. We can say that the assumption of lognormality will be justified in some cases, but that in the overwhelming majority of cases this is not true. Additionally, as noted above, they will be different from an assumed lognormal distribution in different ways. Statistical studies of film style should be developed with this in mind.

## References

Filliben JJ 1975 The probability plot correlation coefficient test for normality, Technometrics 17 (1): 111-117.

Looney SW and Gulledge TR 1985 Use of the correlation coefficient with normal probability plots, The American Statistician 39 (1): 75-79.

Vogel RM 1986 The probability plot correlation coefficient test for the normal, lognormal, and Gumbel distribution hypotheses, Water Resources Research 22 (4): 587-590. 