Some notes on cinemetrics
Over the past couple of weeks various issues have been raised regarding my posts on cinemetrics. Hopefully, today’s post will go some way to addressing points that have been raised, whilst also bring to you attention some things you hadn’t previously considered.
Statistics and the internet
I don’t use expensive software such as SAS or SPSS because there is no need – between MS Excel and open source statistical software you can do pretty much any statistical analysis you want to.
(If you do have access to expensive statistical software, then don’t feel bad – make the most of it. We don’t accept arguments from authority, and your analysis won’t carry more weight just because you’re the underdog).
Learning about statistics
Before using whatever software you have to hand, it is best to understand something about statistics. There are lots of good introductory books to statistics that you can find in any half-decent bookshop or library, and Google books will give you a chance to browse these before you commit your hard earned cash to a purchase (or just use Google books and keep your money for something more exciting). Don’t forget that many non-statisticians have learn something about means and medians (sociologists, doctors, engineers, etc), and as a result there are lots of introductory texts aimed at non-specialists that are often good places to start.
Perhaps the best resource available freely on the internet is Gerard E. Dallal’s The Little Handbook of Statistical Practice, which provides a comprehensive introduction to statistics while at the same time being clear and simple to understand. It’s aimed at bioscience researchers and so all the examples are drawn from this discipline, but they are easy to grasp quickly.
Most universities have some sort of introductory materials for their students online, and you can access most of these for free without any problem. Good ones include Glasgow – which has a glossary of statistical terms; Leicester – which goes through examples of how to do statistical tests and is aimed at biologists; and Vassar – which has an introduction and lots of free online calculators you can use.
Finally, NIST provides a very comprehensive introduction to statistics. This is a bit more technical than the others I have listed above, and it does assume that you have access to some reasonably powerful software – but it does cover just about everything. And it’s free!
Finally, don’t forget that if you want to know about a particular topic in statistics you can just search using Google. If you want to know about Normal Probability Plots then search for it, and you will find a host of websites devoted to this subject.
Statistics is not difficult – but you should understand it before using it.
I mentioned about that Vassar’s stats pages have online calculators, and there are many such calculators on the internet that you can use for free.
Index of online calculators: this site has calculators for descriptive statistics, 2-sample Kolmogorov-Smirnov test, Chi-square, Fisher Exact Test, ANOVA. It is very easy to use, and there are explanations of how each test works and how to interpret the results. You can also use it to draw graphs (although once drawn you can’t do much with them as they come off as gs or pdf files).
GraphPad: GraphPad produce statistical software that is very easy to use and you can download demos and use those (they are cheaper than most stats software, but still pricey). They also have an online introduction to statistics, and a range of stats calculators you can use for free.
Statspages: an index of online calculators, although some of the links don’t always work. Handy if you’re looking for a particular test.
Daniel Soper Statistical Calculators: an index of 45 calculators for computing a whole range of things, though you will probably need to know what you’re doing before you try them. Once you do, very easy to use and tastefully presented in black and green.
SOCR: this is a website with a whole range of statistical tests, but I find it fussy and irritating. It also seems to take a long time to load. I avoid it, but you might it useful. Again, it’s free.
A very useful resource is Free Statistics, which will direct you to websites that will teach you about stats, to sources of data, and to free software that you can download. Some of this software requires a reasonably sophisticated knowledge of programming, whilst others are very simple, but the range is impressive and there is something for everyone.
I use PAST, which is a free statistical software package aimed at palaeontologists and provides a wide range of tests. Obviously some of these are not needed for Cinemetrics (you won’t need to study cladistics if you’re doing film studies), but it is incredibly easy to use, and there is an online manual that explains everything. The only downside is that is a pain to enter data into PAST, and so I usually enter data into Excel and then paste it into PAST before running the analysis.
The best thing to do is to get some data and a little understanding and then play with the software. The best way to learn is to try.
Getting data out of cinemetrics
Of course, if you want to use any of the above you need to have some data. The Cinemetrics database has lots, but you need to get it into some useful form before you can do something with it. Here is a simple process of getting the data from those graphs into your software.
Cinemetrics has two parts: the data and the software. When you change the look of the graph on the page of a film to view the cutting swing, the two parts work in unison. You may have noticed that a lot of red text appears when you change the graph and then disappears. That is the data, and if you can separate it from the software that draws the graph you can see it all. To do this save the page to a directory on your computer and then reopen it when you are not connected to the internet (if you’re on a network you will probably have to set your browser to ‘Work Offline,’ which will be under the File menu the precise details will depend on how your computer is set up). Tell the page to redraw the graph (set the height to 300 and click Redraw). You will see the red data text appear, but because it cannot connect to the software it needs to draw the graph it gets stuck and stays on the screen. Open your spreadsheet software and make the window small enough you that you can see the data from the webpage on the same screen (see Figure 1). You will find that with only a little practice, you can enter the shot lengths very quickly (or you can save what you’re doing and go and have a rest if there are lots of shots). You now have data is a form that is easier to manipulate.
Figure 1 Entering Cinemetrics data for The Lady Lies (1929) into MS Excel
I use Excel 2007 as my main software package. All versions of Excel have a good range of statistical tests, although they are easier to use in the latest versions. The thing with Excel is that you have to remember that it is simple: on the one hand it is easy to use; and on the other, it is not very intelligent and won’t necessarily do things in the easiest way possible. For example, most people don’t know that Excel comes with a statistical analysis package built in – why would they: Excel doesn’t tell them this. In order to install the analysis package you need load the Analysis Toolpack from the Add-ins menu (where this will be depends on the version of Excel you are using, but search the help file for add-ins and you’ll find it). Once installed, the toolpack gives you an automated means of accessing the statistical commands that you normally have to type into the spreadsheet. (Actually using the commands in the spreadsheet is often quicker, easier, and gives you a little bit more control but this depends on how comfortable you are in using the software and statistics. To find out which statistical functions are built into Excel, search the help file for statistical functions). The Analysis Toolpack gives you access to descriptive statistics, ANOVA, z-tests, t-tests, F-tests, correlation, regression, and many other functions, but they are all parametric tests (they depend on the parameters of a distribution and make certain assumptions about the nature of that distribution). If you want nonparametric tests (which make fewer assumptions and don’t rely on parameters), then you can find these in PAST. Alternatively, you can set up your own spreadsheet to do things like the Mann Whitney U test or Kruskal-Wallis ANOVA once you’ve grasped how these tests work and the idiotic way Excel sometimes requires you to do things (see Figure 2).
Figure 2 My spreadsheet for Kruskal-Wallis ANOVA of the films of Terence Davies
Normal Probability Plots
Above I mentioned two types of statistical tests: parametric and nonparametric. It is important to choose the right test to get the most out of your data, and picking the wrong approach may lead you to the wrong conclusion. What distinguishes these two types of tests are the assumptions you can make about the data:
- Parametric tests assume that data is distributed according to an underlying probability distribution (of which there are several, but I’ll only mention a couple here), that data sets have equal and/or independent variances, that the data is at least interval or ratio data, etc. The precise assumptions needed will depend on which test you are using. If the assumptions about the data hold, then parametric tests are more powerful than nonparametric tests.
- Nonparametric tests require fewer assumptions about the nature of the data and do not depend on an underlying probability distribution. They are often referred to as ‘distribution free.’ There is usually a nonparametric equivalent that can be used when a parametric test is inappropriate (for example, Mann Whitney U is the nonparametric equivalent of a t-test for independent samples).
Typically, the distribution of shot lengths in a motion picture is positively skewed with a number of outlying data points: as such, it does not follow a normal distribution. HOWEVER, we could still use parametric tests if the data is normally distributed after a transformation has been applied. Usually, such a transformation involves using logarithms. Once the data has been transformed to its logarithm, we can then run tests to see if the data now follows a normal distribution: if it does, then we say that the data is lognormally distributed. (A random variable is lognormally distributed if its logarithm is normally distributed).
How, then, do we test data to see if it comes from an underlying normal distribution? Well, there are several tests that can be applied: the 1-sample Kolmogorov-Smirnov Test, Shapiro-Wilk*, Cramer-von Mises, Anderson-Darling, Pearson’s Chi-Square*, Jarque-Bera*, and Lillefor’s tests can all be used. (Tests marked * can be found in the PAST software I mentioned above). These tests can be used in varying circumstances – it depends on what you are trying to do.
A simple method which provides both a visual and numerical measure of normality is to use normal probability plots and the probability plot correlation coefficient (PPCC), that I have described elsewhere. (Both PAST and Excel will produce normal probability plots, and PAST also calculates the PPCC). By comparing the value of the PPCC of your data for a specified significance level with the critical value of the PPCC for the size of the dataset you are using you can see if the data is normally distributed: if the observed PPCC is greater than the critical value, then the data is normally distributed; and if it is less than the critical value then it is not normally distributed. For example, The Immigrant (1917) has 159 shots (n = 159): for a sample of this size the critical value of the PPCC is 0.9923. For the untransformed data the observed value of the PPCC is 0.8420 – clearly not normally distributed; and for the data transformed to its common logarithm (log10), the PPCC is 0.9715 – so not lognormally distributed either.
Now, 0.9715 is not much less 0.9923 so maybe there is not a big difference. BUT, in statistics numbers are never just numbers – they have meaning within a specific context. In interpreting the result of the PPCC we need to remember that it comes from a distribution of critical values that is ASYMPTOTIC – that is, it approaches a limit (in this case 1.0) as the sample size grows. It will never actually reach 1, because you can always have a bigger sample, and so the value of the PPCC will get ever close requiring ever more decimal places. Look at Figure 3: this graph plots the critical value (the solid lines) and the observed values (the dotted lines) of the PPCC for His New Job (1915) and Verboten! (1952) using log-transformed data. For His New Job, the observed value is greater than the critical value (the dotted line is to right of the solid line) and so this data is lognormally distributed. For Verboten!, the opposite is true, and the data is not lognormally distrubuted. The sample size (number of shots) is 502 with a critical value of ~0.9971, but the observed value is only 0.9575, which equates to a sample size of only 25. There is only a small numerical difference between 0.9971 and 0.9575, but looking at Figure 3, we can see that actually this is quite a large difference in the context of the asymptotic distribution of the PPCC.
Figure 3 The distribution of the probability plot correlation coefficient for sample sizes n =5 to n = 1000.
What, then, does this mean for The Immigrant? The observed value of the PPCC corresponds to a sample size of ~41, which is nearly four times smaller than the sample size used (n = 159) – a much larger difference than the numbers (if taken at their face value) would appear to imply.
Why is this relevant? In statistics we are estimating outcomes – we rarely know the complete data for any situation, and if we are using the Cinemetrics tool then some error in the data will always be present (you can only press that space bar so quickly in response to observing a cut). If we rely on parametric statistics for assessing shot length distributions when we know the data is not normally or lognormally distributed then we run the risk of saying that there is a difference between two data sets (i.e., the shot lengths of two films) when in fact there isn’t ( a Type I error – false positive), or saying that no difference exists when if fact it does (a Type II error – false negative). Using nonparametric tests is a way around this problem – but will not eliminate the possibility of making an error completely.
I have looked at the PPCC for normal and lognormal distributions of 40 films from the Cinemetrics database, and, while these films cannot be considered a representative sample of the database, half (20) are not lognormally distributed. Some miss their critical value by only small margin but others miss by quite some distance: Man with a Movie Camera has a lognormal PPCC of 0.9639, which would be the critical value for a sample of size 30 but the film has 1729 shots! Of the six reels for this film, only reel 5 is lognormally distributed. Verboten! is worth a mention here – for most films the PPCC test for untransformed data usually produces a value between 0.7000 and 0.9000, while for this film it is only 0.5495 and this deserves closer attention.
Of course, all this assumes that you want to use frequentist statistics. You could adopt a Bayesian approach …