Getting data out of cinemetrics
The Cinemetrics database has a lot of shot length data that you can access freely, but it is not necessarily in the form in which you can get the most out of it. Although some summary statistics are given (mean, standard deviation, minimum, maximum) are given, we may want other statistics (e.g., the median and the median absolute deviation, the lower and upper quartiles, the geometric mean and multiplicative standard deviation) from the data. You may also want to produce some different graphical representations of shot length data (box plots, empirical distribution functions) than the polynomial graph that Cinemetrics automatically produces. In order to calculate these statistics we need to have the value of each shot length.
Various scholars have calculated mean shot lengths by counting the number of shots in a film, and then dividing the running time by this figure. This is useless because you only get the mean shot length (which is not a helpful statistic when shot length data is typically skewed), and you do not get any measure of the dispersion around the mean. It is also a lot of effort for very little reward, as there is very little you can do with this one figure. Having the length of each shot is always preferable, and reports of statistics of film style done using the count and divide method should be viewed with suspicion.
But never fear – the Cinemetrics database has the length of each shot within it. You just need to get at that data. I have described this procedure in an earlier post (here), but it will be useful to repeat it. First, you need get some data, and for the purposes of this demonstration we will use Charlie Chaplin’s His New Job (1915). (The data for this film at the Cinemetrics database is here). Next you need to disconnect the data in the web page from the software that draws the graph, and you can do this by working offline: in your browser, go to the File menu and click ‘Work offline’ (you may need to disconnect from the internet, and if you are on a network this may present its own problems). Now tell the web page to redraw the graph (e.g. set the height to ‘300 pix’ and click Redraw) – a lot of red text should appear, but will not be able to disappear (see Figure 1). This is all the shot length data for the film.
Figure 1 Shot length data becomes visible for His New Job (1915) by separating the data from the Cinemetrics software
Now you can see your data, what should you do with it? Well you could type it into a spreadsheet, which isn’t too bad for His New Job as it only has a total of 175 shots. With practice and patience, you can quickly and accurately enter a lot of data in a spreadsheet. This is, however, a laborious and time-consuming process, and while you can save your work and come back to it later, it soon becomes tedious if a film has a lot of shots or if you are working with data fro a lot of films.
What is needed is a quicker method, and if you are using Microsoft Excel then the following procedure can be used. (If you are using a different spreadsheet, then what follows may still be easy to do but you will need to work out the precise details according to how the spreadsheet and its functions are set up). I’m using Excel 2007 here, but earlier versions will have the same functions that are used here.
The first step is to get the shot length data on to the screen as outlined above.
The next step is to select the red text, and copy and paste it into Excel. How you paste the data is important: right-click on the cell in which you want to paste the first line of the data and select Paste Special, and then in the dialogue box that opens select Text, and click OK. What you should be left with is a spreadsheet with all the text from the Cinemetrics web-page, in which every line of text occupies a single cell in the spreadsheet (see Figure 2).
Figure 2 Text pasted into an Excel spreadsheet.
Now the shot length data is in our spreadsheet, but is not in a usable form due the presence of the text in the same cell. For example, cell B4 in Figure 2 contains “Shot Length: 4.8 sec.”, and we only want the numerical part of this cell (the “4.8”). Using the Replace function we can eliminate the text and be left with just the numbers. In Excel 2007, the Replace function is located on the far-right of the Home tab, or you can open the dialogue box by pressing CTRL+H. (Earlier versions of Excel will have a similar function located in one of the drop-down menus). Once you have opened the Replace function dialogue box, enter “Shot length: ” into the Find what box and leave the Replace with box empty (see Figure 3). Click Replace All.
Figure 3 Using the Replace function to eliminate text and leave behind data
Repeat this process by entering ” sec.” into the Find what box, leave the Replace with box empty, and click Replace All. You should now be left with a spreadsheet in which the shot length data remains without any text. This makes it possible for Excel to recognise the numbers as numbers, and is much quicker than typing all the data out.
But we need to complete a final step before the shot length data is in its final form, and this relies upon using the VLOOKUP function in Excel. If we are trying to save time, we don’t want to waste effort deleting all the rows between the shot length values, and using VLOOKUP we don’t have to.
First, number all the cells from 1 to whatever so that the data for each shot length falls in a cell that is a multiple of 4 (assuming advanced data entry into the Cinemetrics database – for simple data entry the shot length data won’t land in every fourth cell and you will need to adjust the numbering accordingly, either way the same principle applies). See Figure 4 for an example. This can be done very quickly using Excel’s Autofill function.
Figure 4 Numbering data cells
Now we need to tell Excel to look up the value of every fourth cell, using VLOOKUP (which means Vertical Look-up). (If you’re not sure how the VLOOKUP command works, check the helpfile in Excel). To do this, create a number list of multiples of four starting at 4, again using Excel’s Autofill function. This list needs to include every multiple of four that corresponds to a shot length in the data, and as there are 175 shots we need to go to 700 (i.e. 4 multiplied by 175). In the cell next to the number 4 in this new list type in the VLOOKUP command:
=VLOOKUP(E4, A:B, 2, False).
This tells Excel to find the value in cell E4 (in this case “4”) in the array comprised of columns A and B (where our numbered data is) and to return the value in the second column (i.e. column B) wherever there is an exact match (the “False” command). This returns the shot length of the first shot and is “4.8.” Drag to fill the formula in to next cell for each multiple of four, and what you end up with is the shot length data in an easy to use form (see Figure 5). (Actually, you need to select this data, Copy, and Paste Special and past the data into a new column as Values, otherwise it will copy and paste the formulas).
Figure 5 Shot length data in a form ready to use
Once you’ve set up your spreadsheet with the VLOOKUP commands you will not have to repeat this step each time you load new data into Excel, but you will have to do the find and and replace steps each time.
Now you are ready to anlayse the shot length data!
It is a simple method, and once you’ve tried it you will soon find that it is very easy to get a lot of data very quickly, thereby speeding up your research.