# Getting data out of cinemetrics

The Cinemetrics database has a lot of shot length data that you can access freely, but it is not necessarily in the form in which you can get the most out of it. Although some summary statistics are given (mean, standard deviation, minimum, maximum) are given, we may want other statistics (e.g., the median and the median absolute deviation, the lower and upper quartiles, the geometric mean and multiplicative standard deviation) from the data. You may also want to produce some different graphical representations of shot length data (box plots, empirical distribution functions) than the polynomial graph that Cinemetrics automatically produces. In order to calculate these statistics we need to have the value of each shot length.

Various scholars have calculated mean shot lengths by counting the number of shots in a film, and then dividing the running time by this figure. This is useless because you only get the mean shot length (which is not a helpful statistic when shot length data is typically skewed), and you do not get any measure of the dispersion around the mean. It is also a lot of effort for very little reward, as there is very little you can do with this one figure. Having the length of each shot is always preferable, and reports of statistics of film style done using the count and divide method should be viewed with suspicion.

But never fear – the Cinemetrics database has the length of each shot within it. You just need to get at that data. I have described this procedure in an earlier post (here), but it will be useful to repeat it. First, you need get some data, and for the purposes of this demonstration we will use Charlie Chaplin’s *His New Job* (1915). (The data for this film at the Cinemetrics database is here). Next you need to disconnect the data in the web page from the software that draws the graph, and you can do this by working offline: in your browser, go to the File menu and click ‘Work offline’ (you may need to disconnect from the internet, and if you are on a network this may present its own problems). Now tell the web page to redraw the graph (e.g. set the height to ‘300 pix’ and click Redraw) – a lot of red text should appear, but will not be able to disappear (see Figure 1). This is all the shot length data for the film.

**Figure 1** Shot length data becomes visible for *His New Job* (1915) by separating the data from the Cinemetrics software

Now you can see your data, what should you do with it? Well you could type it into a spreadsheet, which isn’t too bad for His New Job as it only has a total of 175 shots. With practice and patience, you can quickly and accurately enter a lot of data in a spreadsheet. This is, however, a laborious and time-consuming process, and while you can save your work and come back to it later, it soon becomes tedious if a film has a lot of shots or if you are working with data fro a lot of films.

What is needed is a quicker method, and if you are using Microsoft Excel then the following procedure can be used. (If you are using a different spreadsheet, then what follows may still be easy to do but you will need to work out the precise details according to how the spreadsheet and its functions are set up). I’m using Excel 2007 here, but earlier versions will have the same functions that are used here.

The first step is to get the shot length data on to the screen as outlined above.

The next step is to select the red text, and copy and paste it into Excel. How you paste the data is important: right-click on the cell in which you want to paste the first line of the data and select Paste Special, and then in the dialogue box that opens select Text, and click OK. What you should be left with is a spreadsheet with all the text from the Cinemetrics web-page, in which every line of text occupies a single cell in the spreadsheet (see Figure 2).

**Figure 2** Text pasted into an Excel spreadsheet.

Now the shot length data is in our spreadsheet, but is not in a usable form due the presence of the text in the same cell. For example, cell B4 in Figure 2 contains “Shot Length: 4.8 sec.”, and we only want the numerical part of this cell (the “4.8”). Using the Replace function we can eliminate the text and be left with just the numbers. In Excel 2007, the Replace function is located on the far-right of the Home tab, or you can open the dialogue box by pressing CTRL+H. (Earlier versions of Excel will have a similar function located in one of the drop-down menus). Once you have opened the Replace function dialogue box, enter “Shot length: ” into the Find what box and leave the Replace with box empty (see Figure 3). Click Replace All.

**Figure 3** Using the Replace function to eliminate text and leave behind data

Repeat this process by entering ” sec.” into the Find what box, leave the Replace with box empty, and click Replace All. You should now be left with a spreadsheet in which the shot length data remains without any text. This makes it possible for Excel to recognise the numbers as numbers, and is much quicker than typing all the data out.

But we need to complete a final step before the shot length data is in its final form, and this relies upon using the VLOOKUP function in Excel. If we are trying to save time, we don’t want to waste effort deleting all the rows between the shot length values, and using VLOOKUP we don’t have to.

First, number all the cells from 1 to whatever so that the data for each shot length falls in a cell that is a multiple of 4 (assuming advanced data entry into the Cinemetrics database – for simple data entry the shot length data won’t land in every fourth cell and you will need to adjust the numbering accordingly, either way the same principle applies). See Figure 4 for an example. This can be done very quickly using Excel’s Autofill function.

**Figure 4** Numbering data cells

Now we need to tell Excel to look up the value of every fourth cell, using VLOOKUP (which means Vertical Look-up). (If you’re not sure how the VLOOKUP command works, check the helpfile in Excel). To do this, create a number list of multiples of four starting at 4, again using Excel’s Autofill function. This list needs to include every multiple of four that corresponds to a shot length in the data, and as there are 175 shots we need to go to 700 (i.e. 4 multiplied by 175). In the cell next to the number 4 in this new list type in the VLOOKUP command:

=VLOOKUP(E4, A:B, 2, False).

This tells Excel to find the value in cell E4 (in this case “4”) in the array comprised of columns A and B (where our numbered data is) and to return the value in the second column (i.e. column B) wherever there is an exact match (the “False” command). This returns the shot length of the first shot and is “4.8.” Drag to fill the formula in to next cell for each multiple of four, and what you end up with is the shot length data in an easy to use form (see Figure 5). (Actually, you need to select this data, Copy, and Paste Special and past the data into a new column as Values, otherwise it will copy and paste the formulas).

**Figure 5** Shot length data in a form ready to use

Once you’ve set up your spreadsheet with the VLOOKUP commands you will not have to repeat this step each time you load new data into Excel, but you will have to do the find and and replace steps each time.

Now you are ready to anlayse the shot length data!

It is a simple method, and once you’ve tried it you will soon find that it is very easy to get a lot of data very quickly, thereby speeding up your research.

Posted on January 28, 2010, in Cinemetrics, Film Analysis, Film Studies, Film Style and tagged Cinemetrics, Film Analysis, Film Studies, Film Style. Bookmark the permalink. 1 Comment.

No need to break in, the door is open. If you people can wait for a couple of days Gunars and I have decided to add a button “access raw data.” The intellectual policy of Cinemetrics is anti-sectarian, we don’t “support” one approach or method at the expense of any other. Neither do stake out a claim for data submitted by others or ourselves. Cinemetrics does not stick to a methodological doctrine. What it was made for is to fascilitate film students to test their hypotheses about films, filmmakers and film history.

Gunars will also add “Median” to the list of cinemetrics statistics, plus “other statistics,” with more maths Nick Redfern finds wanting. The argument Nick makes above (and in the “Measures of Central Tendency” essay) is strong enough to make one aware of the advatanage of more resisitant statistics than the mean on which ASL is based. Strong enough to add, but not strong enough to remove. To the extent I can follow Nick’s argument, radical as it is, there is no evidence that puts Average Shot Length out of court. The method you use, wise statisticians remind us, depends on the context. Indeed, many things really change depending on what films and filmmakers we look at. There will be lots of outliers when we look at Breathless, Citizen Cane or films by Ophuls, but befor we make any sweeping methodological conclusions based on this I would look at films by Ozu, for instance, in which as Matt Hauske has shown display less outlying cases. Let’s hold our horses for a while.

Another reason why one would not want to jettison ASL replacing it with Median is not to throw out the baby with the water. Despite the arithmetical mean being less resistent to outliers, ASL has been proven an eye-opening tool to explain many things on different levels — from the large-scale tendency of “editing in the twenty-first century” (a chapter in the brand new edition of Salt’s Film Style and Technology), from “the way hollywood tells it” (Bordwell’s book), from the editing vagaries at the time of films’ passage to sound disclosed in Charles O’Brien’s book — to close studies of this or that film by Vertov and Ozu. Yes, there may be better tools around, and we are looking for ones, because that’s what tools always do — they get better and better. But the one that we have been using so far has the one and only power film scholars need statistics for: the explanatory power and the power to demonstrate what has been hereto only intuitively clear. In this I disagree with Nick Refer’s mantra he has repeated on more that one occasion — that film statistics ougtht to be first of all good statistics. As someone who cared for movies more that I do for methods, I would modify this wisdom ading one word to remind that, for us, statistics is a means to an end. Film statistics ougtht to be good enough statistics. We need to be wary not to fetishize tools.

Another reason for preserving and restoring our faith in ASC is to preserve the continuity and to avoid rupture in films studies. Remember, the arithmetical average was the only tool available to film scholars before the advent of computers. A lot of data have been (and is still being) amassed by scholars that use only a shot counter and a watch to calculate an average. We will add a new parameter, the Median, but we will not remove ASL, for a) it has shown itself reliable in film studies and b) it makes Cinemetrics back-compatible with pre-cinemetrics measurements.

Let me use a metaphor for this. ASL and Median tell us the same story in two different ways. It is like a face having two views — in profile and full face. There was, at one point in the history of film studies, only one way of showing the face, like in the art of Egypt, in profile. Another one has now become available — the median, the full face view. Rather than replacing one with the other, Cinemetrics will offer both — much as police mug shots do.