Bar chart or histogram?
Although it is becoming more and more common for film scholars to cite statistics of film style in their research, there is a pressing need for a good statistics textbook aimed at those working in film studies because it is evident that there is very little actual understanding of statistics. The only attempt to provide some sort of instruction in statistical methodology has been undertaken by Warren Buckland in Studying Contemporary American Film: A Guide to Movie Analysis, which he co-authored with Thomas Elsaesser (Elsaesser and Buckland 2002 – to access this chapter freely see here). Unfortunately, Buckland gets the most basic elements of statistics wrong (see here for a demonstration of how wrong), and this week I am going to focus on two aspects: the importance of distinguishing between qualitative and quantitative variables, and the difference between bar charts and histograms.
In this chapter, Buckland discusses the frequency of different shot scales in Jurassic Park, and produces the graph in Figure 1.
Figure 1 The frequency of shot scales in Jurassic Park (Source: www.cinemetrics.lv/buckland.php) (BCU = big close-up, CU = close-up, MCU = medium close-up, MS = medium shot, MLS = medium long shot, LS = long shot, VLS = very long shot).
This image is accompanied with the following text:
Finally, in terms of shot scale, the distribution [of shot scales] confirms (sic) to what statisticians call a ‘normal distribution’, with high values in the middle (the mean) and progressively lower values on either side (…). The result of these normal distributions is that the standard deviation and skewness values are low.
This statement is simply incorrect – this distribution does not conform to a normal distribution, and no statistician in the world would say so. This problem arises because Buckland does not know how to distinguish between different types of variable and does not know the difference between a bar chart and a histogram. Linking the two terms together, Buckland treats them as synonyms:
the histograms, or bar charts, representing the number of each shot type in each film (the number of close-ups, long shots, etc.)
Actually, this error originates in Salt (1992: 142), which Buckland has simply copied.
This is wrong because bar charts and histograms are two different types of graph that represent different types of variables providing different types of information, and it is important to know the difference.
Types of variable
A variable is simply a measured attribute of interest that varies over time or subjects. The actual value of a variable is its data value, and the set of data values from each subject in the sample is the data we are going to analyse. So shot length is a variable of film style, the length of a shot is its data value, and the set of all the shot lengths is the data set.
It is important to know what type of variable we are dealing with, because this will determine the type of statistical analysis we will be able to apply. A variable may be qualitative or quantitative, and may be measured at one of four levels of data – nominal, ordinal, interval, or ratio. (Variables may also be discrete or continuous but I will not address this here).
Qualitative variables have values that are non-numeric and descriptive.
Qualitative variables may be either nominal – data can be sorted into mutually exclusive categories that characterise an element of a subject but to which no order may be assigned; or ordinal – in which the categories used have a logical ordering.
Camera movement is an example of a qualitative nominal variable in the cinema: we can sort camera movements in a film into different categories (pan, track, tilt, etc), but we cannot assign an order to these categories. In order to make data analysis easier, we might choose to code the categories of camera movement using numbers (e.g. pan = 1, track = 2, tilt = 3, etc) but these codes have no mathematical meaning. We can define the mode for this type of data as simply the most frequently occurring value, but the median and the mean do not exist. It is nonsense to speak of an ‘average’ camera movement in the same way as we speak of an ‘average’ shot length.
Shot scale is an example of a qualitative ordinal variable of film style, but this ordering is very weak. When it comes to the statistical analysis of shot scales then we first need to assign each shot to a category in which the variables are non-numeric (BCU, CU, MCU, etc) – and therefore qualitative. We can also assign order to these variables: a big close-up is ‘nearer’ to the object than a close-up, a close-up is nearer than a medium close-up, etc. However, the difference between the categories are not meaningful – a big close-up is closer than a close-up, but ‘closer than’ is not otherwise defined. The mode may be expressed for this type of variable; and it may be possible to define a median for qualitative ordinal data but it may not always be appropriate to do so. For example, it we asked a hundred people to rate Inception (2010) on a scale of 1 to 10, where 1 = ‘did not enjoy at all’ and 10 = ‘enjoyed enormously,’ we could meaningfully state that the median rating is 7/10. In contrast, it makes sense to speak of medium shots as the modal class of shot scales in Jurassic Park, but not to say that medium shots are the ‘median shot scale’ even though shot scales can be logically ordered.
The mean does not exist for qualitative ordinal variables, which is why Buckland’s use of the term ‘normal distribution’ – which is parameterized by the mean and the standard deviation – to describe shot scales in Jurassic Park is meaningless. The mean and the standard deviation in these circumstances do not exist.
Statistical analysis of nominal variables involves looking at the frequency with which an event occurs (e.g., how many panning shots are there in a film?), and I outlined some methods for categorical data analysis of film style in my post on hypothesis tests of proportions for film style (here). If the data is ordinal, and it is appropriate to do so, nonparametric methods such as the Median test, the Mann-Whitney U test, the Kruskal-Wallis test, etc. may be employed in data analysis. Parametric statistical methods cannot be applied to qualitative variables.
If we want to represent the data collected about qualitative variable then a bar chart is the simplest method to employ. Pie charts can also be used and are particularly effective when emphasising differences in frequency by their use of area (see here, for example), but they are less useful if your data has a lot of detail or you want to compare two different groups (for which you will need two different pie charts).
Remember, you cannot think about a qualitative variable in terms of a probability distribution. To illustrate this, look at the two bar charts in Figures 2a and 2b. These bar charts present the same information – the normalised frequency of different shot scales in The Birds (1963) – but have been arranged differently: Figure 2a shows us the shot scales arranged from nearest (BCU) to most distant (VLS), while Figure 2b shows us the scales arranged from most distant to nearest. If we thought of these figures in terms of a continuous random variable – as Buckland does with Jurassic Park above – should we conclude that the shot scales are positively skewed as indicated by Figure 2a or negatively skewed as in Figure 2b? The answer is neither because the question is meaningless: we cannot think about a qualitative variable represented by a bar chart in these terms. If you conclusion changes according to how you order the data, then the design of your experiment is very probably flawed.
Figure 2a Normalised frequency of shot scales in The Birds (1963) arranged from nearest to most distant (Source: www.cinemetrics.lv/satltdb.php)
Figure 2b Normalised frequency of shot scales in The Birds (1963) arranged from most distant to nearest (Source: www.cinemetrics.lv/satltdb.php)
What can we say about shot scales in The Birds? Well, we can see that the most frequently occurring scale is the close-up, followed the medium close-up, whereas more distant shot scales are much less frequent. We can therefore conclude that this film is characterised by shot scales that bring the viewer close to action on-screen, particularly when Melanie Daniels is being attacked. Note that these conclusions do not depend on how the data is arranged in either Figure 2a or Figure 2b – they depend solely on the data themselves and the intrinsic order of the data. We could also represent this information as a proportion or a percentage if we so wished without changing the conclusions.
There are some simple rules for presenting bar charts:
- The gaps between the categories in a bar chart are important: they emphasise the fact the categories used are mutually exclusive and do not form a continuum. Note that in Salt (1992: 143) the bars in the charts for shot scales do touch.
- Make sure the scale used is meaningful and clearly labelled, and does not mislead the reader in interpreting the chart by overemphasizing differences.
- Colour and shading can be useful, but can also be misleading an irritating.
- If the data do not have any logical ordering, arrange the categories from in order to make the differences easier to interpret. It may also be easier to rotate the chart to put the category labels on the vertical axis and the numerical values on the horizontal.
- Use bars to represent values clearly, rather than pictures of different sizes. They are easier to understand, and far less irritating.
- NEVER add a third dimension to the bars on your chart – the extra dimension adds no new information and is potentially misleading. To see how bad this can be, look at charts three and four in Charles O’Brien’s paper on Sous le toits de Paris (1931) here.
For some really good examples of how not to present data in bar charts and pie charts, see Gary Klass’s Just Plain Data Analysis website here. This website is especially useful as it also gives tips and examples on how to use Excel to draw charts.
Quantitative variables have values that are numeric, and quantify an element of a population.
Quantitative variables may be measured at the interval or ratio levels. With interval data, the distance between data values are meaningful but there is no natural zero. With ratio data the distances between data values are meaningful, and there is a common origin. We can calculate the mode, the median, and the mean of interval and ratio data.
Quantitative variables have order so they can also be treated as ordinal variables, although this does lose some of the information from the data set. Nonparametric methods applied to quantitative variables may involve transforming the data into an ordinal form by ranking methods, but this is advantageous because it means that nonparametric methods may be applied when the requirements for parametric methods are either unknown or are not met.
Shot length is a quantitative variable measured at the ratio level – the difference between a shot that is 2 seconds long and one that is 3 seconds long is the same as the difference between a shot that is 5 seconds long and one that is 6 seconds long; and a shot that is 4 seconds long is twice as long as one that is 2 seconds long.
Like bar charts, a histogram is produced by sorting data into categories (called bins). However, unlike a bar chart, the values on the x-axis form a continuum: the point at which one bin ends is the point at which the next bin begins. For this reason, neighbouring bars in a histogram must touch. In a bar chart, frequency is expressed as the height of the bar; whereas in a histogram it is expressed as the area of the bar.
A histogram is a simple nonparametric method of density estimation, and depends only on the choice of the location for x₀ and the width of the bins. From a histogram we can identify the shape of a distribution (uni-, bi-, or multimodal, symmetrical, skewed, leptokurtic, or platykurtic); the range of the data; and the presence of outliers.
Unlike a bar chart, where the gaps between the bars stress the absence of a continuum on the x-axis, the gaps in a histogram have a different meaning. Because the x-axis is a continuum, a gap in the data indicates that there were no data values in this bin. Figure 3 is a histogram of the distribution of shot lengths in Busy Bodies (1933), with x₀ = 0 seconds, and a bin width of 2 seconds. The values on the x-axis are the mid-points of the bins, so the first bin covers shots of length 0.0s to 2.0s, the next bin covers 2.0s to 4.0s, the next bin covers 4.0s to 6.0s, and so on.
Figure 3 Histogram of shot lengths in Busy Bodies (1933)
From Figure 3 we can see that the distribution of shot lengths in Busy Bodies is (1) unimodal and positively skewed; (2) that the range of the data is from 0.0 seconds to 48.0 seconds; and (3) there are outliers in the upper tail of the distribution. We can see that short shots occur much more frequently than long shots. There are gaps in the distribution, indicating that there are some bins that contain no shots.
- The shape of the histogram depends on the choice of x₀ and the bin-width, and making the wrong choice can led to flawed interpretations. Too many bins and you cannot see the structure of the data properly due to the presence of too much information; too few bins and you cannot see the structure at all. There are various methods for choosing the ideal bin-width, but none is definitive.
- There is a lack of precision in describing the range: the actual range of this data is from 0.5s to 47.6s, but the histogram cannot give us this level of precision without using too many bins. Information is lost in the process of binning the data.
- You cannot put the shot length distributions of two films on to the same histogram, and so it becomes necessary to produce two histograms and compare them side by side. This is the same problem as comparing two pie charts side by side, and is equally undesirable.
The limitations of the histogram may be overcome by employing kernel density estimation. See here for an overview.
In Studying Contemporary American Film: A Guide to Movie Analysis, Elsaesser and Buckland have undertaken to demonstrate the methodologies of film analysis to students and to encourage them to apply them to films themselves. Unfortunately, the section on statistics is fundamentally flawed due to the authors’ lack of understanding of elementary statistics. It is not a textbook that students should be encouraged to read, as it will leave them with an erroneous understanding of statistical methodology. It also does not say much for the standard of research in film studies.
Students should be taught to properly identify the type of variable they are dealing with, because this will determine the statistical methods they subsequently employ. They should know the difference between qualitative and quantitative variables, and be able to identify which elements of film style are which. They should be able to distinguish between the different levels of data they will encounter. They should know the difference between a bar chart and a histogram, when it is appropriate to use either, and how to produce and interpret each. They should also know the specific statistical meaning of terms such as ‘mean,’ ‘standard deviation,’ ‘skew,’ and ‘normal,’ and when it is appropriate to use them.
Elsaesser T, and Buckland W 2002 Studying Contemporary American Film: A Guide to Movie Analysis. London: Arnold.
Salt B 1992 Film Style and Technology: History and Analysis. London: Starwood.
Posted on January 13, 2011, in Cinemetrics, Film Analysis, Film Studies, Film Style, Statistics, Warren Buckland and tagged Cinemetrics, Film Analysis, Film Studies, Film Style, Statistics. Bookmark the permalink. 3 Comments.