# Blog Archives

## Age, gender, and television in the UK

UPDATE: This article has now been published – in a corrected form (see the comments below) – as Age, Gender, and Television in the United Kingdom,

Journal of Popular Television3 (1) 2015: 57-73. DOI: 10.1386/jptv.3.1.57_1. The post print of the article can be accessed here: Nick_Redfern – Age Gender and Television post print.

In December 2011 I published a post on genre preferences among UK cinema audiences, applying correspondence analysis to data from the BFI’s *Opening Our Eyes* report. You can read the article that was subsequently published in *Participations* last year here.

At the time I meant to write a follow up piece on genre preferences for UK television audiences using data from the same source but I never quite got round to it. I have now finished this analysis and the draft article can be found in the pdf file attached to this post. I also look at how age and gender affect audiences perceptions of television as a medium

We apply correspondence analysis to data produced for the BFI’s

Opening Our Eyesreport published in 2011 to discover how age and gender shape the experience of television for audiences in the UK. Age is an important factor in shaping how audience perceive television, with older viewers describing the medium as ‘informative,’ ‘thought provoking,’ ‘artistic,’ ‘good for people’s self-development,’ and ‘escapist’ and while younger viewers are more likely to describe television as ‘exciting,’ ‘fashionable,’ and ‘sociable.’ Younger respondents are also more likely to describe the effect of television on people/society as negative. Variation in programme choice is highly structured in terms of age and gender, though the extent to which of these factors determine audience choice varies greatly. Gender is the dominant factor in explaining preferences for some programme types with age a secondary factor in several cases, while age is the explanatory factor for other genres for which gender seemingly has little influence. Male audiences prefer sports, factual entertainment, and culture programmes and female audiences reality TV/talent shows, game/quiz/panel shows, chat shows, and soap operas. Older audiences prefer news, documentaries, and wildlife/nature programmes, while music shows/concerts and comedy/sitcoms are more popular with younger viewers.

The BFI report and the raw data can be accessed here.

## Film style and narration in Rashomon

UPDATE: 13 April 2014: The revised version of this article has now been published as Film Style and Narration in

Rashomon,Journal of Japanese and Korean Cinema5 (1-2) 2013: 21-36. DOI: 10.1386/jjkc.5.1-2.21_1.A post-print of the article can be downloaded here: Nick_Redfern – Film style and narration in Rashomon (post print)

And so after a long (and much enjoyed break) I return to the blogosphere with the first draft of paper on film style and narration in *Rashomon*. This paper is different to other statistical analyses of film style I have published on this site and to all other studies of film style and narration because it uses multivariate analysis to look at several different aspects of film style together. The method used is multiple correspondence analysis, and you can find a good introductory chapter on MCA here. The software I used is FactoMineR for R, and the website explaining how to do the analysis can be found here.

Multivariate analysis has been used in the quantitative study of literature for some time (see the links below the abstract), but this is the first time multivariate analysis has been applied to film style and it appears to work very well. I am currently looking at some other applications, particularly in distinguishing between the different parts of portmanteau horror films (which is a proper scholarly endeavour and not simply an excuse to watch lots of portmanteau horror films).

An Excel file contain the data used in the analysis can be accessed here: Nick Redfern – Rashomon. This file contains two worksheets: the first is the shot length data for the film, and the second is that data used in the multiple correspondence analysis.

## Abstract

This article analyses the use of film style in

Rashomon(1950) to determine if the different accounts of the rape and murder provided by the bandit, the wife, the husband, and the woodcutter are formally distinct by comparing shot length data and using multiple correspondence analysis to look for relationships between shot scale, camera movement, camera angle, and the use of point-of-view shots, reverse-angle cuts, and axial cuts. The results show that the four accounts of the rape and the murder inRashomondiffer not only in their content but also in the way they are narrated. The editing pace varies so that although the action of the film is repeated the presentation of events to the viewer is different each time. There is a distinction between presentational (shot scale and camera movement) and perspectival (shot types) aspects of style depending on their function within the film, while other elements (camera angle) fulfil both these functions. Different types of shot are used to create the narrative perspectives of the bandit, the wife, and the husband that marks them out as either active or passive narrators reflecting their level of narrative agency within the film, while the woodcutter’s account exhibits both active and passive aspects to create an ambiguous mode of narration.Rashomonis a deliberately and precisely constructed artwork in which form and content work together to create an epistemological puzzle for the viewer.

On the multivariate analysis of literature see the following:

**Hoover DL** 2003 Multivariate analysis and the study of style variation, *Literary and Linguistic Computing* 18 (4): 341-360.

**Stewart LL** 2003 Charles Brockden Brown: quantitative analysis and literary style, *Literary and Linguistic Computing* 18 (2): 129-138.

**Tabata T** 1995 Narrative style and the frequencies of very common words: a corpus-based approach to Dickens’s first person and third person narratives, *English Corpus Studies* 2: 91-109.

## The mAR index of Hollywood films

UPDATE (March 2015): A revised version of this paper has now been published as Robust estimation of the mAR index of high grossing films at the US box office, 1935 to 2005, *Journal of Data Science* 12 (2) 2014: 277-291. [The pdf of this article can be accessed here: 4.JDS-1181_final-1].

UPDATE: reviewing the methodology of the mAR index in general, Mike Baxter noted an error in the data whereby I had reported the exponent of the negative exponential function instead of the mAR index for films from the 1960s. I have now corrected this and redone the analysis and the graphs (which are still cool). This mainly effects the conclusions regarding differences between genres. Overall, it turns out that, as a result of this error, I had actually underestimated the difference between the classical and rank mAR indices. If anyone finds any other errors then feel free to add a comment to this post and I’ll try to correct it as soon as possible.

And so to finish the month as we started, looking at robust estimates of the mAR index of film style. Below is the first draft of a paper comparing the mAR index based on the methods used by James Cutting, Jordan De Long and Christine Nothelfer to describe the clustering of shots in motion picture with a rank-based alternative that is resistant to outliers. Naturally, it features some pretty cool graphs.

Robust estimation of the modified autoregressive index for high grossing films at the US box office, 1935 to 2005The modified autoregressive (mAR) index describes the clustering of shots of similar duration in a motion picture. In this paper we derive robust estimates of the mAR index for high grossing films at the US box office using a rank-based autocorrelation function resistant to the influence of outliers and compare this to estimates obtained using the classical, moment-based autocorrelation function. The results show that (1) The classical mAR function underestimates both the level of shot clustering and the variation in style among the films in the sample.; (2) there is a decline in shot clustering from 1935 to the 1950s followed by an increase from the 1960s to the 1980s and a levelling off thereafter rather than the monotonic trend indicated by the classical index, and this is mirrored in the trend of the median shot lengths and interquartile range; and (3) the rank mAR index indentifies differences between genres missed by the classical index.

## Robust estimation of the modified autoregressive index of film style

Earlier this I looked at the time series structure ITV news bulletins using robust methods of autocorrelation. This post follows on from that earlier study, this time looking at BBC news bulletins. This paper was written with three goals in mind. First, I wanted to improve on the method used before. Second, I wanted to try the rank based method of estimating the mAR index. Third, I wanted to apply these methods to a different cluster of data sets to see if I would come up with similar results.

The paper can be accessed as a pdf file here: Nick Redfern – Robust estimation of the modified autoregressive index of film style

AbstractThe modified autoregressive index (mAR) describes the tendency of shots of similar length to cluster together in a motion picture but is not resistant to the influence of outliers if derived from the classical moment-based partial autocorrelation function. In this paper we calculate robust estimates of the modified autoregressive index based on outlier-resistant partial autocorrelation function based on the ranks of the shot length data and robust measure of scale. The classical, rank, and robust methods of determining mAR are compared for a sample of BBC news bulletins.

## Some notes on cinemetrics V

This post addresses some issues raised by Mike Baxter as part of the ‘cinemetrics conversation’ at the Cinemetrics website (and is the post I would have produced last week had I been able to remember which bit of software had the right command to create the necessary graph). You can find an introduction to the conversation here and my first response to some of the issues raised here.

I want to address two issues: first, the nature of outliers in shot length distributions and better methods of representing such distributions than I have used up to now; and, second, the straw-man the median shot length has become in Baxter’s comments.

Baxter’s comments in response to the earlier can be found in the second tab under his name here. In section 2 Baxter questions my use of the term ‘outlier’ and the definition used to identify such shots. This is fair enough – we wouldn’t get very far if such definitions weren’t questioned. In the examples of *Lights of New York* and *Scarlett Empress*, Baxter argues there is no evidence of outliers since

it’s difficult to identify any point at which ‘extremes’ begin, or discontinuities in the distribution of the kind I think are needed to assert, with any confidence, that you are dealing with ‘outliers.’

Baxter never defines what such a discontinuity would look like and so his argument is vague. (Arguably this is the semantic version of a slippery slope).

Figure 1 shows the kernel density and box plot of *Lights of New York*. There is a 12.2 second gap between the five shots of longest duration and the sixth longest, presumably the sort of discontinuity Baxter refers to and he does concede he might be prepared to accept five shot lengths as extreme values (though he does not say on what basis). From Figure 1 we can see there are in fact several such discontinuities, and that the kernel density is zero at several points in the upper tail (indicating the kernels do not overlap), particularly above 30 seconds (which corresponds to the 22 extreme outliers identified using this type of box plot). However, a limitation of this boxplot is that it does not take into account the skew of the distribution and so over identification of outliers is a problem.

**Figure 1** Kernel density and boxplot of shot lengths in *Lights of New York* (1928)

Figure 2 presents the same data using an adjusted boxplot that takes into account the skewed nature of the data. This method uses the med-couple, a robust measure of skewness, to identify outliers. The adjusted boxplot can be generated using the **adjbox()** command in the **R** package robustbase.

The number of outliers in Figure 2 is much less than in the original boxplot: in the upper tail 10 shots greater than 55 seconds are identified as outliers (or 3% of the total). Nonetheless, there are still some values which are sufficiently removed from the rest of the data to be classed as outliers even when accounting for the asymmetry of the distribution. Whether or not Baxter would accept this definition would depend on the interpretation of his use of the term ‘discontinuity,’ which he does not define.

Surprisingly, this method identifies three outliers in the lower tail of the distribution (which I wasn’t expecting and will have to think about more).

**Figure 2** Kernel density and adjusted boxplot of shot lengths in *Lights of New York* (1928)

The following article describes the adjusted boxplot and its calculation:

Vandervieren E and Hubert M2008 An adjusted boxplot for skewed distributions,Computational Statistics and Data Analysis52 (12): 5186-5201. An ungated, earlier version of this paper can be accessed here.

Even if we accept Baxter’s argument that there are no outliers in *Lights of New York* it remains necessary to be aware of the problems caused by outliers in data sets and to check the distribution of shot lengths so that we are not be fooled by non-robust statistics. Certainly more effort will have to be devoted to defining what is or is not an outlier (in either statistical or filmic terms) in research if this type. (But it is much easier when you remember which bit of software to use).

Finally, I wish to address a misrepresentation that has taken a hold at this early stage in the ‘cinemetrics conversation.’

Baxter writes

the use of either the ASL or median as

thestatistic for attempting to summarise ‘style’ doesn’t make much sense (as Salt observes) [original emphasis].

This argument is a straw-man.

I have never stated that the median shot length is *the* statistic for describing film style. I have argued that the median shot length *is better than* the mean shot length for describing film style, and should therefore be preferred for the following reasons:

- the median is conceptually simple and easy to calculate, and is certainly no more difficult than the mean.
- the median shot length has a clearly defined meaning and the difference between two median shot lengths is also meaningful, whereas the meaning of the mean the difference between two mean shot lengths is not clear in either case (and seem to change every time I raise an objection against them).
- the median shot length is not affected by a monotone transformation (the median of a data set is the same as the median of the logarithmic transformation of a data set), while the possibilities for confusing the arithmetic and geometric means are endless.
- the median locates the centre of a distribution irrespective of its shape, whereas this is not true of the mean.
- the median is not affected by outliers or extreme values (however you choose to define them), whereas this is not true of the mean.
- interpretations of film style based on the median shot length are consistent with graphical methods and (it turns out) with dominance statistics (Cliff’s
*d*, HLΔ), while those based on the mean shot length are not.

*But* I have always argued that it is important use a range of statistical methods to get a full understanding of the nature of film style.

As far as I am aware I am the only person writing about film style to even consider the dispersion of shot lengths in a motion picture and the appropriate methods to use this. I am also the only person to use a range of graphical methods (probability plots, boxplots, empirical cumulative distribution functions, kernel densities, order structure matrices, running Mann-Whitney Z statistics, rank-frequency plots) to describe film style. I am the only person in film studies to employ confidence intervals, statistical hypothesis tests, effect sizes, or even to describe the methodologies I use in studying film style. (Others working outside films studies in disciplines where quantitative methods are commonplace also use such tools as a matter of routine, and those within film studies would do well learn by their example).

I am also the only person who has attempted to describe these methods so that others may try to analyse film style for themselves. I am the only person who has brought to the attention of researchers in film studies the availability of free learning resources and software for statistics. I am the only person to look outside film studies for empirical research on film style and to bring it to the attention of film scholars. I am the only person to address the issue of statistical literacy in film studies (here and here).

Baxter writes that

the accessibilty of computational power, and essential simplicity of important statistical ideas (however mathematically complex) is a hobby-horse of sorts.

I am glad to hear this, because it means that if someone else is prepared to devote some time and effort to explaining statistical concepts and methods to film scholars then I won’t have to do it on my own.

However, as Baxter presents the argument I am interested in the median shot length only while Barry Salt apparently does not have a narrow attachment to a particular statistic of film style and embraces a pluralistic approach. However, I am not aware of any forum in which Salt has made any concession to his view that the mean shot length is the only appropriate statistic of film style. In fact, I am unaware of any other statistics of film style used by Salt besides the average shot length and the histogram (while his odd comments on the calculation of kernel density estimates indicates he may not properly understand other methods).

Baxter has his argument back to front here: you won’t find methodological ecumenism in the statistical analysis of film style in the work of Barry Salt.

## Statistical Resources: Free Statistics Lectures

The truly great thing about the internet is the amount of really good stuff you can get for free, and one of the best things if you want to learn something new is the availability of lectures from universities via media sites. YouTube has a large number of statistics lectures available for you to peruse: searching for “statistics lecture” returns 641 hits, and searching for specific topics in statistics and probability will return much more.

Here is a selection of introductory statistics lectures that are freely available on YouTube that you might want to try if you are interested in applying statistical methods in film studies and don’t have easy access to a statistician.

Possibly the best place to start is Daniel Judge’s Statistics Lecture from the Department of Mathematics at East Los Angeles College. This lecture is clearly delivered and starts with a focus on data. A good feature is that unlike some other available, this lecture is broken up into bite size chunks so its much easier to manage. Subsequent lectures in the series look at describing data numerically and graphically, probability theory, and the normal distribution.

Here’s a great introductory lecture that uses baseball to explain (amongst other things) the difference between parameters and statistics and samples and populations (which I have commented on elsewhere), and which also explains why a batting average isn’t an average. A common problem in the use of statistics in film studies is that statistical terms are used without any proper understanding of what they mean, and this lecture goes to great lengths to explain what is meant by *categorical data* or *relative frequency*.

Math Doctor Bob has a whole series of video lectures available covering a very wide range of topics in mathematics, including statistics and probability. I’ve found his lectures on matrix algebra very useful. This is probably not the place to start if you’re a beginner since the lectures cover specific demonstrations of individual topics and often assume some knowledge of maths but they are very clear and easy to follow. Here is a lecture on how to do a two-tailed hypothesis test.

Finally, here is part 1 of Hans Rosling’s BBC programme *The Joy of Stats* from the Open University which is worth taking some time to watch (even if you only want to know which part of England had the highest rate of bastardy in 1842). The other parts of the programme can be accessed at the OU’s stats playlist here.

## Analysing film style using dominance statistics

UPDATE: An article using the ideas introduced in this post has now been published as Comparing the Shot Length Distributions of Motion Pictures using Dominance Statistics, Empirical Studies of the Arts 32 (2) 2014: 257-273. DOI: 10.2190/EM.32.2.g. It can be found here.

Statistical comparisons of film style have been based on the average shot length (either the mean or the median), so that, for example, given the ASLs of two films the one with the greater average is said to be edited more slowly.

In his first contribution to the Cinemetrics conversation, Mike Baxter argued that in some circumstances neither the mean nor the median were useful statistics of film style. In this post I look at how we might compare the shot length distributions of two films or two groups of films beginning with an average shot length. The methods used are Cliff’s *d* statistic, which measures the stochastic dominance of one sample over another, and the Hodges-Lehman median difference, which measures the average distance between. Results produced by these methods are then compared to the interpretation of film style using average shot lengths, measures of dispersion, and graphical methods. This will also provide us with an opportunity to consider Baxter’s further claim that it makes little difference which average was used since either would lead to the same interpretation of film style.

### Cliff’s *d* statistic

Cliff (1993, 1996) introduced the stochastic difference

*d* = *P*(*X* >*Y*) – *P*(*X*<*Y*)

as a nonparametric method of measuring the extent to which two samples (*X* and *Y*) overlap. This means we find the probability that an observation in the sample is X is greater than an observation in sample Y, and from this we subtract the probability that an observation in *Y* is greater than an observation in *X*. Ties are not included in the calculation. Cliff’s *d* statistic can be calculated as a linear transformation of the probability of superiority:

*d* = 2*PS* – 1

where *PS* is equal to the Mann-Whitney *U* test statistic divided by the product of the sample sizes (*PS* = *U*/*nm*) (see Delaney & Vargha 2002). Since *PS* = *P*(*X *> *Y*) + 0.5*P*(*X* = *Y*), ties are accounted for. The value of *d* ranges from -1 (when every observation in *X* is less than every observation in *Y*) to 1 (when every observation in *X* is greater than every observation in *Y*); and stochastic equality occurs at 0 (when there is complete overlap between the distributions).

This statistic has several advantages for comparing two distributions:

- It is not based on any assumptions about the data
- it is robust against outliers and unequal variances
- it is invariant under monotonic transformation
- it provides a more direct answer to the sort of questions researchers often wish to ask of data: ‘if one’s primary interest is in a quantification of the statement “
*X*s tend to be higher than*Y*s,” then [*d*] provides an unambiguous description of the extent to which this is so’ (Cliff 1996: 125).

The stochastic dominance of one sample over another can be visualised graphically since *d* measures the extent to which one population distribution lies to the right of another.

### Hodges-Lehmann median difference

Although we can use Cliff’s *d* to discover if the shots in one film tend to be shorter than the shots in another it cannot tell us how much shorter those shots tend be. For this we need another statistic. The Hodges-Lehmann median difference (HLΔ) for two samples is the median of all the pairwise differences between every observation in *X* and every observation in *Y*:

HLΔ = med{*x*_{i} – *y*_{j}}

In other words, subtract the length of every shot in film A from every shot in film B and then find the median of the *n* × *m* differences. HLΔ is a measure of the average distance between observations in and *X* and observations in *Y*.

### Comparing the style of two films

As a first example let’s use the example of *Lights of New York* and *Scarlet Empress* I used in my own contribution to the Cinemetrics conversation. Basing our interpretation on the median shot lengths we see that *Lights of New York* has a median of 5.1 seconds and that *Scarlet Empress* has a median of 6.5 seconds, indicating that the former is edited more quickly than the latter. In contrast, an interpretation based on the mean shot length implies that both films are cut equally quickly since each film has a mean shot length of 9.9 seconds.

To calculate *d* we first need to perform the Mann-Whitney *U* test, which gives us *U* = 88188, and then we derive the probability of superiority by dividing by the product of the sample sizes (338 and 601):

*PS* = *U*/*nm* = 88188/(338 × 601) = 0.4341.

From this we can calculate the stochastic dominance between the two distributions:

*d* = 2*PS* – 1 = (2*0.4341) – 1 = –0.1318.

Therefore, we conclude that shots in *Lights of New York* tend to be of shorter duration than those of *Scarlet Empress*. This can be clearly seen in Figure 1, which shows the empirical cumulative distribution functions of the two films.

**Figure 1** The empirical cumulative distributions of *Lights of New York* and *Scarlet Empress* (KS Test: *D* = 0.12, *p* = <0.01)

The function of *Scarlet Empress* tends to lie to the right of that of *Lights of New York* indicating that it has shots of longer duration, except for the very upper tail where the presence of a few unusually long takes in *Lights of New York*, which account for only ~7% of the shots in this film. It is this handful of shots that pulls the mean away from the mass of the data, and if we remove the 24 longest shots from the distribution of *Lights of New York* we see that the mean shot length falls to 6.4 seconds. This is clearly a very influential group of outliers as just this 7% of the total number of shots leads to a 33% difference in the mean equivalent to a 3.5 second increase. It takes an act of wilful perversity to claim that there are no outliers present in this data, that the mean of not greatly influenced by those outliers, and that the mean shot length is an accurate description of the style of this film.

For these two films HLΔ = -1.0 (95% CI: -1.6, -0.4), which means that on average a shot in *Lights of New York* is 1 second shorter in duration than a shot in *Scarlet Empress*.

The interpretation of the difference in the style of these films based on Cliff’s *d* and HLΔ is consistent with that based on the median shot length but not with the conclusion derived from the mean shot length. The difference in these statistics indicates that far from leading to the same conclusion they lead to contrary and incompatible conclusions, and so Baxter’s argument that the choice of statistic is irrelevant does not hold in this case.

### Comparing the style of two groups of films

Comparing the style of two groups of films we use the same methods described above and calculate the pairwise statistics for all the films in both samples. We can then take the median value of the *n* × *m* *d* statistics and of the *n* × *m* HLΔ statistics as estimates of the differences of the

To illustrate this I use the example of the Laurel and Hardy short films I discussed in an earlier paper. In this study I compared the median shot lengths of a sample of silent films and a sample of sound films starring Laurel and Hardy produced between 1927 and 1933, and concluded that there was a statistically significant difference between the two samples of medians but that it was a small difference reflecting the continuity of a mode of production, of creative personnel, and of a style of comedy with the introduction of sound technology. The difference in the median shot lengths was estimated to be HLΔ = 0.5 seconds (95% CI: 0.1, 1.1) and *PS* = 0.2333. (I also compared statistics of the dispersion of shot lengths in these films but I won’t discuss these here).

If this analysis had been conducted using the mean shot length then I would have reached a different conclusion, with HLΔ = 1.5 seconds (95% CI: 0.8, 2.3) and *PS* = 0.1188. This result would appear to indicate that the introduction of sound technology had a large impact on the style of Laurel and Hardy films and would lead us to conclude there is no continuity from the silent to the sound era. Again, there is a difference in the interpretation of the style of these films indicated by the different statistics: the estimate of the impact of sound technology based on the means is 300% greater than that based on the medians. Again, Baxter’s argument that the choice of statistic does not matter simply doesn’t hold water.

What conclusion do the dominance statistics lead to? As we have a sample of 12 silent films and a sample of 20 sound films we need to perform a total of 12 × 20 = 240 calculations. Table 1 presents the pairwise comparisons for Cliff’s *d*, while the pairwise HLΔ statistics are in Table 2.

The median of the pairwise Cliff’s *d* statistics is -0.0957 (95% CI: -0.1192, -0.0723). This indicates that shots in the silent films of Laurel and Hardy tend to be of shorter duration than those of their sound films, and that this effect is relatively small.

**Table 1** Pairwise Cliff’s *d* statistics for silent and sound films of Laurel and Hardy. (This table is very large so click on it to see it full size).

The median of the pairwise HLΔ statistics is 0.4s (95% CI: 0.3, 0.5), which again indicates a significant if small difference between the samples with the shots in the soundtending to be of slightly longer duration on average than those of the silent films.

**Table 2** Pairwise HLΔ statistics for silent and sound films of Laurel and Hardy. (This table is very large so click on it to see it full size).

Both these results are consistent with my analysis based on the mean shot length. Neither of these statistics is compatible with the interpretation based on the mean shot lengths.

A problem with applying Cliff’s d and HLΔ in this way is that as the sample sizes grow the number of pairwise comparisons becomes very large. For example, if we wanted to compare the style of two groups of films with 100 films in each sample we would have to perform 100 × 100 comparisons. That’s a total of 10,000 Mann-Whitney U tests, and while we are interested in film style I don’t think we’re *that* interested. It is here that the consistency of Cliff’s *d* and HLΔ with the median shot length is valuable. It is quick and easy to perform even a very large number of pairwise comparisons of median shot lengths simply by copying formulas across a range of cells in an Excel spreadsheet, for example. We can use the median shot length in the place of the dominance statistics thereby greatly speeding up the analytical process while allowing us to remain secure in our interpretation of the data. We cannot use the mean shot length in the same way since this method is not consistent with any of the others.

### Conclusion

Based on the above discussion we can arrive at the following conclusions:

- The claim that it does not matter which statistic of film style we use since using either the mean or the median will lead to the same interpretation is clearly not true and the choice of statistic will affect the size of any effect. In turn, this will have a direct impact on our conclusions about the nature of film style.
- We can analyse the style of films using dominance statistics that do not require any average shot length. Cliff’s
*d*and HLΔ are. The meaning and interpretation of these statistics may correspond more closely to questions we wish to ask of film style than using average shot lengths (though we still need descriptive statistics and graphs to provide information about the shot length distribution). - It may not be practical to use dominance statistics for comparing large samples of films due to the very large number of pairwise comparisons required. Mike Baxter indicated that an average shot length could be thought of as a ‘proxy statistic’ of film style, and the median shot length can certainly be used in this sense by virtue of its consistency with Cliff’s
*d*and HLΔ. - The mean shot length is not robust in the presence of outliers and leads to fundamentally flawed interpretations of film style. It is not consistent with either Cliff’s
*d*or HLΔ, and cannot be used to answer the question ‘do the shots in film A tend to be longer than the shots in film B.’

### References

**Cliff** **N** 1993 Dominance statistics: ordinal analyses to answer ordinal questions, *Psychological Bulletin* 114 (3): 494-509.

**Cliff N** 1996 *Ordinal Methods for Behavioural Data Analysis*. Mahwah, NJ: Lawrence Erlbaum Associates Inc.

**Delaney HD and Vargha A** 2002 Comparing several robust tests of stochastic equality with ordinally scaled variables and small to moderate sample sizes, *Psychological Methods* 7 (4): 485-503.

## The Cinemetrics Conversation I

Over the past few months Yuri Tsivian at the Cinemetrics Database has been organizing myself and various other people interested in statistics into producing some short (and some long) pieces on this topic. (No mean task on his part I think you’ll agree). From this week they have started to appear on the Cinemetrics website, and you can access them here. I reproduce Yuri’s introduction to the area below so you can get an inkling of what has been going on.

This conversation brings together statistical scientists and scholars that study film. What gathers us are two things. First, we are driven by mutual curiosity about cinemetrics as a field. What can numbers tell us about films and how do films fit in with what we know about numbers? Another thing we hope to find out has to do with Cinemetrics as a site. What variables should Cinemetrics make available to its users and which statistical tools need to be added to Cinemetrics labs? We plan to tackle these questions in a series of notes posted here starting from now through spring 2013.

Let me start off by introducing the team. My name is Yuri Tsivian, I study film, teach it at the University of Chicago and, in tandem with computer scientist Gunars Civjans, run the site that hosts this conversation. Beside me are two film scholars, Barry Salt of London Film School who pioneered the discipline of film statistics in 1974 and whose personal database and multiple essays are found elsewhere on this website, and Nick Redfern whose own website features over 50 cinemetrics studies and reflections. On the other side are two academic statisticians, Mike Baxter of Nottingham Trent University who has been publishing in statistical archaeology and quantitative geography since late 1970s and whose more recent interest in film statistics resulted in 3 essays on the subject, and Vanja Dukic of the University of Colorado at Boulder who happened to be around when Cinemetrics was born in 2005 and to whose expertise this site owes its first statistical steps.

The way I would like this conversation to evolve is round by round. To give it a sense (or semblance) of direction I will start each round by posing a question about this or that aspect of statistical films studies which our four experts might use as a starting point. Here is an approximate plot which is quite likely to change as new questions arise in the course of the conversation. My first question (of which more later) is about the role of ASLs, medians and outliers. This subject may well lead us to questions about log-normality tests which will ring in the second round. We may go on from there to the 3rd question which would relate to whether parametric or non-parametric statistics works better for films. The 4rth question might be about autocorrelation or other possible methods to establish cases in which shots tend to cluster, and if there is periodicity to this. We may then want to discuss the uses of descriptive, inferential and experimental statistics in film studies; I would also be interested in learning more about best ways to establish possible correlations between different variables of film style. We might then go on to the question of how to visualize data, for instance, whether old good bar plots work well enough to represent the shot scale profile of a motion picture. Again, all this is just a scheme which we may either flesh out or send the way of all flesh.

So head on over there to find out what is going on. There will soon be comments boxes appended to the essays so you can join in the process.

## Statistical Resources: How to Read a Paper

Statistics abound and you need to be able to understand them – even in film studies (see here and here). There are many textbooks that will tell you how to *do* statistics, but far less attention is paid to being able to understand statistics as a consumer and – somewhat bafflingly – as a user of statistical methods. There are many good statistical textbooks but understanding of the use of statistics in research rarely features. The result is that learning statistics is a lot like being taught how to write before you have been taught how to read. It would be much easier to things the other way round.

Fortunately, there is a series of articles by Trisha Greenhalgh under the heading ‘How to Read a Paper’ published in the *British Medical Journal* in 1997 that do precisely this. Even better, they are freely available through Pubmed. If you are thinking of using statistics in research in film studies or if you come across statistics in the research you are reading then it would definitely help to have read these first.

The papers can be accessed at the links below:

**Greenhalgh T** 1997a Getting your bearings (deciding what the paper is about), *British Medical Journal* 315 (7102): 243-246.

**Greenhalgh T** 1997b Assessing the methodological quality of published papers, *British Medical Journal* 315 (7103): 305-308.

**Greenhalgh T** 1997c Statistics for the non-statistician: different types of data need different statistical tests, *British Medical Journal* 315 (7104): 364-366.

**Greenhalgh T** 1997d, Statistics for the non-statistician II: “significant” relations and their pitfalls, *British Medical Journal* 315 (7105): 422-425.

It may be necessary to scroll down through the pdf to find the relevant section.

Although these articles are aimed at doctors dealing with medical research the basic principles apply in all areas and are a good place to start if you want to be able understand the use of statistics in research in film studies being able to read the paper will obviously be an advantage.

## Using box plots to analyse film style

Numerical descriptions of film style are valuable but it is often simpler and more informative to use graphical representations of shot length data to aid us in analysing film style. Following on from earlier posts on using kernel densities (here) and cumulative distribution functions (here) this post rounds out this short series by looking at box plots and vioplots. Potter (2006) provides a detailed survey of the methodology of constructing and interpreting box plots and a discussion of extensions and alternatives.

Box-plots are an excellent method for conveying a large amount of information about a data set quickly and clearly, and do not require any prior assumptions about the distribution of the data. Analysing the box-plots of shot lengths in motion pictures we compare the centre and variation of the data, and identify the skew and the presence of outliers. They are also an efficient method of comparing multiple data sets, and placing the box-plots for two or more films side-by-side allows us to directly compare the centre and variation of shot length distributions in intuitively.

The box plot provides a graphical representation of the five-number summary, which includes the minimum value, the lower quartile, the median, the upper quartile, and the maximum value of a data set. The core of the data is defined by the box, which covers the distance between the lower and upper quartiles (i.e. the IQR), and the horizontal line within the box represents the median value of the data. The inner fences are marked by error bars extending from the box, and data points beyond these limits are classed as outliers. An outlier is defined as greater than Q3 + (IQR × 1.5) and an extreme outlier as greater than Q3 + (IQR × 3). Typically, there are no outliers at the low-end of a shot length distribution, and the error bar descends to the value of the shortest shot in a film.

To illustrate, Table 1 presents the descriptive statistics for the three main ITV news bulletins broadcast on 10 August 2011. There is nothing wrong with this information, and we can see immediately that these bulletins have similar styles. They have similar medians indicating they are cut equally quickly, whilst the lunchtime bulletin has slightly more variation of the middle 50 per cent than the other two bulletins. We can also see that the distributions of shot lengths in these films are asymmetric and that the maximum values are much longer than other shots. However, we cannot tell if these maximums are isolated outliers or if there are a large number of such values.

**Table 1** Descriptive statistics of ITV news bulletins broadcast on 10 August 2011

Figure 1 presents the box plots of these bulletins, and gives us some of the detail we are looking for. We can see the same information we get in Table 1, but it is easier to make the comparisons across a single scale than to try to imagine the distribution froma set of numbers. We can also see that these bulletins share some other features – the error bars extend a similar distance from the upper quartile with shots in this range (10-18 seconds) associated with short interviews with members of the public, while the clusters of outliers that can be seen for each bulletin in the range 18-30 seconds are associated with the news kernel that begins each item and longer interviews as part of a news report. Longer takes occur when a reporter is speaking directly to camera, typically as part of a two-way interview. We can therefore see that similar events in the discourse structure of these news bulletins occupy a similar amount of screen time within the same bulletin and across the bulletins broadcast on the same day. You cannot tell that from the five-number summary. This is a crucial advantage of using graphical methods alongside numerical summaries – they can be used *analytically* as well as descriptively. You can learn more from a Figure 1 than you can from Table 1, though it would be best to include both in a piece of research since knowing the actual values of the descriptive statistics is useful to the reader.

**Figure 1** Box plots of three ITV news bulletins

By using a box plot we can see some of the structure of the data obscured by the five-number summaries. However, one of the problems with box plots is that they flatten out the detail of the distribution in the box and between the box and the ends of the error bars. This can be remedied by combining box plots with a kernel density to produce a *vioplot*. This has the advantage of making all the information available from these two types of plots in a single figure. Figure 2 presents the vioplots of these bulletins.

**Figure 2** Box plots of three ITV news bulletins

From Figure 2 we can see all the detail from the box plots AND we can see that the density of shot lengths in those areas where the box plot provides no detail. For example, the similarities in the 10-18 second range are more apparent in Figure 2 than Figure 1. For an alternative way of combining box plots and kernel densities to describe these data sets see here.

It has become increasingly common for film scholars to cite average shot lengths, but this information is rarely useful to the reader. It is usually the wrong average, is unaccompanied by a measure of dispersion, and simply does not provide enough information for anyone to make a sensible judgement about the nature of a film’s style. If you do want to use statistics to make a point about film style then please include kernel densities, cumulative distribution functions, or box/vioplots so that we can see what you are talking about. This should be standard practice in research and publishing in film studies.

### References

**Potter K** 2006 Methods for presenting statistical information: the box plot, in H Hagen, A Kerren, and P Dannenmann (eds.) *Visualization of Large and Unstructured Data: Lecture Notes in Informatics* *GI-Edition *S-4: 97–106.