Category Archives: Top Hat

Using kernel densities to analyse film style

1. Introduction

Since a film typically comprises several hundred (if not thousands) of shots describing its style clearly and concisely can be challenging. This is further complicated by the fact that editing patterns change over the course of a film. Numerical summaries are useful but limited in the amount of information they can convey about the style of a film, and while two films may have the same median shot length or interquartile range they may have very different editing patterns. Numerical summaries are useful for describing the whole of a data set but are less effective when it comes to accounting for changes in style over time. These problems may be overcome by using graphical as well as numerical summaries to communicate large amounts of information quickly and simply. Graphs also fulfil an analytical role, providing insights into a data set and revealing its structure. A good graph not only allows the reader to see what is important about a data set the writer wishes to convey, but also enables the researcher to discover what is important in the first place.

It should be common practice in the statistical analysis of film style to include graphical summaries of film style (though this is rarely the case), and there are several different types of simple graphs that can be used. These include cumulative distribution functions, box-plots, vioplots, and time-ordered displays such as run charts and order structure matrices. In this post I describe two different uses of kernel density estimation as graphical methods for analysing film style. The next section introduces the basics of kernel density estimation. Section three discusses the use of kernel densities to describe and compare shot length distributions, while section four applies kernel densities to the point process of two RKO musicals to describe and compare how cutting rates change over time.

2. Kernel Density Estimation

The kernel density is a nonparametric estimate of the probability density function of a data set, and shows us the range of the data, the presence of any outliers, the symmetry of the distribution (or lack thereof), the shape of the peak, and the modality of the data (Silverman 1986; Sheather 2004). A kernel density thus performs the same functions as histogram but is able to overcome some of the limitations of the latter. Since no assumptions are required about the functional form of the data kernel densities are a useful graphical method for exploratory data analysis (Behrens & Yu 2003). The purpose of exploratory data analysis is to reveal interesting and potentially inexplicable patterns in data so that we can answer the general question ‘what is going on here?’ Kernel densities allows us to this by describing the relative likelihood a shot in a film will take on a particular value, or by allowing us to see how the density of shots in a film changes over time.

The kernel density is estimated by summing the kernel functions superimposed on the data at every value on the 𝑥x-axis. This means that we fit a symmetrical function (the kernel) over each individual data point and then add together the values of the kernels so that the contribution of some data point xi to the density at x depends on how far it lies from x. The kernel density estimator is

where n is the sample size, h is a smoothing parameter called the bandwidth, and K is the kernel function. There are several choices for K (Gaussian, Epanechnikov, triangular, etc.) though the choice of kernel is relatively unimportant, and it is the choice of the bandwidth that determines the shape of the density since this value controls the width of the kernel. If the bandwidth is too narrow the estimate will contain lots of spikes and the noise of the data will obscure its structure. Conversely, if the bandwidth is too wide the estimate will be over-smoothed and this will again obscure the structure of the data. The kernel density estimate is an improvement on the use of histograms to represent the density of a data set since the estimate is smooth and does not depend on the end-points of the bins, although a shared limitation is the dependence on the choice of the bandwidth. Another advantage of the kernel density is that two or more densities can be overlaid on the same chart for ease of comparison whereas this is not possible with a histogram.

Figure 1 illustrates this process for Deduce, You Say (Chuck Jones, 1956), in which the density shows how the shot lengths of this film are distributed. Beneath the density we see a 1-D scatter plot in which each line indicates the length of a shot in this film (xi), with several shots having identical values. The Gaussian kernels fitted over each data point are shown in red and the density at any point on the x-axis is equal to the sum of the kernel functions at that point. The closer the data points are to one another the more the individual kernels overlap and the greater the sum of the kernels – and therefore the greater the density – at that point.

All widely available statistical software packages produce kernel density estimates for a data set. An online module for calculating kernel densities can be found here.

3. Describing and comparing shot length distributions

A shot length distribution is a description of the data set created for a film by recording the length of each shot in seconds. Analysing the distribution of shot lengths in a motion picture allows us to answer questions such as ‘is this film edited quickly or slowly?’ and ‘does this film use a narrow or a broad range of different shot lengths?’ Comparing the shot length distributions of two or more films allows us to determine if they have similar styles: is film A edited more quickly than film B and does it exhibits more or less variation in its use of shot lengths? A kernel density estimate provides a simple method for answering these questions.

From the kernel density of Deduce, You Say in Figure 1 we see the distribution of shot lengths is asymmetrical with the majority of shots less than 10 seconds long. There is a small cluster of shots around 15 seconds in length, and there are three outliers greater than 20 seconds. From just a cursory glance at Figure 1 we can thus obtain a lot of information very quickly that can then guide our subsequent analysis. for example, we might ask what events are associated with the longer takes in this film?

Figure 1 The kernel density estimate of shot lengths in Deduce, You Say (Chuck Jones, 1956) showing the kernel functions fitted to each data point (N = 58, Bandwidth = 1.356)

Suppose we wanted to compare the shot length distributions of two films. Figure 2 shows the kernel density estimates of the Laurel and Hardy shorts Early to Bed (1928) and Perfect Day (1929). It is immediately that clear though both distributions are positively skewed, the shot length distributions of these two films are very different. The density of shot lengths for Early to Bed covers a narrow range of shot lengths while that for Perfect Day is spread out over a wide range of shot lengths. The high density at ~2 seconds for Early to Bed shows that the majority of shots in this film are concentrated at lower end of the distribution with few shots longer than 10 seconds, while the lower peak for Perfect Day shows there is no similar concentration of shots of shorter duration and the shot lengths are spread out across a wide range (from 20 to 50.2 seconds) in the upper tail of the distribution. We can conclude that Early to Bed is edited more quickly than Perfect Day and that it shot lengths exhibit less variation; and though we could have come to these same conclusions using numerical summaries alone the comparison is clearer and more intuitive when represented visually.

Figure 2 Kernel density estimates shot lengths in Early to Bed (1928) and Perfect Day (1929)

4. Time series analysis using kernel densities

Film form evolves over time and we can use kernel density estimation to describe the cutting rate of a film. Rather than focussing on the length of a shot (L) as the time elapsed between two cuts, we are interested in the timing of the cuts (C) themselves. There is a one-to-one correspondence between cuts and shot lengths, and the time at which the jth cut occurs is equal to the sum of the lengths of the prior shots:

Figure 3 shows the one-to-one nature of this relationship clearly.

Figure 3 The one-to-one relationship between shot lengths (Li) and the timing of a cut (Cj)

Analysis of the cutting rate requires us to think of the editing of a film as a simple point process (Jacobsen 2006). A point process is a stochastic process whose realizations comprise a set of point events in time, which for a motion picture is simply the set of times at which the cuts occur. We apply the same method used above to the point process to produce a density estimate of the time series. Just as the density in the above examples is greatest when shot lengths are closer together, the density is greatest when one shot quickly follows another and, therefore, the shorter the shot lengths are at that point in the film. Conversely, low densities indicate shots of longer duration as consecutive shots will be distant from one another on the x-axis. This is similar to the use of peri-stimulus time histograms and kernel methods in neurophysiology to visualize the firing rate and timing of neuronal spike discharges (see Shimazaki & Shinamoto 2010).

Using kernel density estimation to understand the cutting rate of a film as a point process is advantageous since it requires no assumptions about the nature of the process. Salt (1974) suggested using Poisson distributions as a model of editing as a point process described by the rate parameter λ, but this method is unrealistic since homogenous Poisson point processes are useful only for applications involving temporal uniformity (Streit 2010: 1). For a motion picture the probability distribution of a cut occurring at any point in time is not independent of previous cuts, and the time series will often be non-stationary over the course of a film while also demonstrating acceleration and deceleration of the cutting rate because different types of sequences characterised by different editing regimes. We expect to see clusters of long and short takes in a motion picture and so the assumption of a Poisson process will not be appropriate, while the presence of any trends will mean that the process does not satisfy stationarity. Modelling the cutting rate as an inhomogeneous Poisson point process by allowing λ to vary as function of time may solve some – though not necessarily all – of these problems.

To illustrate the use of kernel densities in time series analysis we compare the editing of two films tow feature Fred Astaire and Ginger Rogers: Top Hat (1935) and Shall We Dance (1937). In order to make a direct comparison between the evolution of the cutting rates the running time of each film was normalised to a unit length by dividing each shot length by the total running time. In this case we treat slow transitions (e.g. fades, dissolves, etc) as cuts, with the cut between two shots marked at the approximate midpoint of the transition. Figure 4 shows the resulting densities.

From the plot in Figure 4 for Top Hat we see the density for this film comprises a series of peaks and troughs, but that there is no overall trend . The low densities in this graph are associated with the musical numbers, while the high densities occur with scenes based around the rapid dialogue between Astaire and Rogers. (See here for alternative time series analyses of Top Hat that use different methods but arrive at the same conclusions as those below).

The first musical number is ‘No Strings (I’m Fancy Free)’, which begins at ~0.07. Astaire is then interrupted when Rogers storms upstairs to complain about the racket, and we have a scene between the two in which both the dialogue and the editing are rapid. This occurs at the peak at ~0.11 to ~0.13, and is then followed by a reprise of ‘No Strings,’ which is again shot as a long takes. The next section of the film follows on the next day as Astaire takes on the role of a London cabby and drives Rogers across town and as before this dialogue scene is quickly edited resulting in a high density of shots at ~0.19. This sequence finishes with ‘Isn’t This a Lovely Day (to be Caught in the Rain),’ which accounts for the low density of shots at ~0.21 to ~0.27 since this number again comprises long takes. The rapid cutting rate during dialogue scenes is repeated when Rogers mistakes Astaire for a married man at the hotel, and is again followed by the low density of a slow cutting rate for the scenes between Astaire and Edward Everett Horton at the theatre and the number ‘Top Hat, White Tie and Tails’ at ~0.4. After this number the action moves to Italy and there is much less variation in the density of shots in the first part of these scenes, which are focussed on dialogue and narrative. There is no big musical number until ‘Cheek to Cheek’ and this sequence accounts for the low density seen at ~0.66, being made up of just 13 shots that run to 435.7 seconds. The density increases again as we move back to narrative and dialogue until we get to the sequence between in which Horton explains the mix-up over who is married and who is not to the policeman and ‘The Piccolino’ which begins at ~0.89 and runs until ~0.96.

The density plot of the point process for Shall We Dance differs from that of Top Hat showing a trend over the running time of the film from higher to lower densities of shots, indicating the cutting rate in this film slows over the course of the film. Nonetheless we see the same pattern of troughs and peaks, and as in Top Hat these are associated with musicals and comedy scenes, respectively.

This film features numerous short dancing excerpts in its early scenes, but there is no large scale musical number until well into the picture. In fact, these early scenes are mostly about stopping Astaire dancing (e.g. when Horton keeps turning off the record), and the dialogue scenes that establish the confusion over Astaire’s married status as the ship departs France. These scenes are based around a similar narrative device to that used in Top Hat and are again edited quickly. The first big number in the film is ‘Slap that Bass’ and coincides with the low density section of the film beginning at ~0.17, indicating that this part of the film is edited more slowly that the first section. The cutting rate slowly increases until ~0.37, and this section includes the ‘Walking the Dog’ and ‘I’ve Got Beginner’s Luck’ numbers but is mostly made up of dialogue scenes between Astaire and Rogers. After this point the film exhibits a trend from higher to lower densities and there are a number of smaller cycles present between 0.37 and 0.64. This section includes the numbers and ‘They All Laughed (at Christopher Columbus)’ and the subsequent dance routine, which begins at ~0.48 and includes the trough at ~0.54. The low density section beginning at 0.64 is the scene between Astaire and Rogers in which they try to avoid reporters in the park, and comprises a number of lengthy dialogue shots and the film’s most famous number, ‘Let’s Call the Whole Thing Off.’ The editing then picks up during the dialogue scenes until we reach the next drop in the density at ~0.74 which coincides with the scenes on the ferry to Manhattan as Astaire sings ‘They Can’t Take That Away From Me.’ The next low density section begins at ~0.9, and is the big production at the end of the film with the distant framing and static camera completing the long takes in showing off the ‘Hoctor’s Ballet’ sequence, which then gives way to a more rapidly cut section featuring numerous cut-ways from the dancers to Rogers’ arriving at the theatre with the court order for Astaire only to discover him on stage with dancers wearing masks of her face. The cutting rate then slows once more as Rogers insinuates herself into the ‘Shall We Dance’ routine and the film reaches its finale.

Figure 4 Kernel density estimates of the point processes for two RKO musicals with normalised running times

Comparing the two plots we note some of the low density periods coincide with one another. This is most clearly the case at around 0.2 and 0.64 in both films. The major numbers that end the films also occur at similar points in the narratives. This indicates that a musical number occurs at approximately the same points in both films even though the two films have different running times (Top Hat: 5819.9s, Shall We Dance: 6371.4s). This raises some interesting questions regarding the structure of other musicals featuring Astaire and Rogers. Is there always a musical number about a fifth of the way into an RKO musical featuring this pair? Is there always a major number about two-thirds the way through picture? And does the finale always occupy the last 10 per cent of the picture? Answers to these questions will have to wait until I finish transcribing all the films Astaire and Rogers made for RKO in the 1930s.

5. Conclusion

Kernel density estimation is a simple method for analysing the style of motion pictures, and the wide availability of statistical packages makes the use of kernel densities easy to incorporate into empirical research. Since it requires no prior assumptions about the distribution of the data this method is appropriate for exploratory data analysis. In this paper we demonstrated the how this method may be used to describe and compare the shot length distributions of motion pictures and for the time series analysis of film style.


Behrens JT and Yu C-H 2003 Exploratory data analysis, in JA Schinka and WF Velicer (eds.) Handbook of Psychology: Volume 2 – Research methods in Psychology. Hoboken, NJ: John Wiley & Sons: 33-64.

Jacobsen M 2006 Point Process Theory and Applications: Marked Point and Piecewise Deterministic Processes. New York: Birkhauser.

Salt B 1974 Statistical style analysis of motion pictures, Film Quarterly 28 (1): 13-22.

Sheather SJ 2004 Density estimation, Statistical Science 19 (4): 588-597.

Shimazaki H and Shinamoto S 2010 Kernel bandwidth optimization in spike train estimation, Journal of Computational Neuroscience 29 (1-2): 171-182.

Silverman B 1986 Density Estimation for Statistics and Data Analysis. London: Chapman & Hall.

Streit RL 2010 Poisson Point Processes: Imaging, Tracking, and Sensing. Dordrecht: Springer.

Time Series Analysis of Top Hat (1935)

The editor Millie Moore (Johnny Got His Gun, Go Tell The Spartans), said

… one of the most important jobs of the picture editor is to control the tempo and pace of the story (Yewdall 2007: 156).

The ebb and flow of pace and tempo determines the dramatic form of a film, and it is through editing (along with camera motion and sound energy) that the viewer’s attention is structured. Dorai and Venkatesh (2001) observed that in Hollywood narrative cinema, large changes of pace occur at the boundaries of story segments (e.g. transitions between scenes), while smaller changes in pace are identified with local narrative events of high dramatic import. Similarly, Cutting et al. (2011) noted that within each quarter and possibly each act of a Hollywood film there is a pattern of general shortening and then lengthening of shots reflecting a fluctuating intensification of continuity. Different emotional states are associated with different editing styles (Kang 2002). In the television schedule, adverts are edited more quickly than the programmes around them in order to attract the viewer’s attention and to improve product recall (Young 2007).

It would seem natural that the methods of time series analysis could help us to describe the evolution of the tempo and pace over the course of a film and thereby to understand how and why this element of film style changes.

However, there are a number of problems:

  • Time is not an independent variable: typically we apply time series methods to understand how some variable (e.g. stock prices, animal populations, etc) changes as a function of time, but here the variable of interest is time itself (i.e. the amount of time between two edits). This does not make time series analysis impossible, but it does require careful interpretation of the results: for example, spectral analysis will be event-based rather than time-based, and will show the number of events per cycle rather than the duration of the cycle in some unit of time. Treating this data as a ‘standard’ time series may lead to incorrect interpretation of the style of a film.
  • Shot length data is typically positively skewed with a number of outliers: many common methods of time series analysis (e.g. running means, autocorrelation functions) assume that the data is normally distributed, but this is not the case for the shot lengths in a motion picture; and failing to take this into account can lead to flawed conclusions and erroneous estimations of parameters for time series models.
  • Shot length data may exhibit nonlinear characteristics: many time series methods assume that the data is linear, but we may find that the style of a film exhibits conditional heteroscedasticity (e.g. the variance of shot lengths in a rapidly edited action sequence will be lower than in slower dialogue sequences), that any cycles present may be asymmetric, or that there are abrupt changepoints in style as one scene ends and another begins. Other nonlinear features may also be apparent.

These problems can be overcome by using ordinal or rank-based methods that make fewer assumptions about the distribution of the data and allows us to conduct exploratory data analysis before deciding on how to model the evolution of style in a film. Crucially, we need not be concerned that time is the variable of interest as these methods require only that the data is ordered – which in this case means the order in which they occurred (shot 1 is the first shot, shot 2 is the second, …). Two methods are illustrated here: running Mann Whitney Z statistics and the order structure matrix. The data set used here is for Top Hat (1935), and can be accessed here as an Excel file: Nick Redfern – Top Hat.

The running Mann-Whitney Z statistic

The Mann-Whitney U test is a nonparamteric test of the null hypothesis that two random variables are stochastically equal. For an introduction to the Mann Whitney U Test see here. Steve Mauget (2003, 2011) has applied the Mann-Whitney U test to time series analysis of climate data by using moving windows to sample the ranks of shots in order to identify regimes of high and low ranking data points. This method can be used to identify trends in the time-ordered data, to identify any intermittent cyclical regimes, and to identify changepoints in the series as the style of a film evolves. This method is akin to using a moving average, but instead of looking at the level in successive windows, we are looking at the ranks of the data.

The first step in generating a time series is to rank the N shots in a film from the smallest to the largest, with tied values assigned the average of the ranks they would have been assigned if there were no ties: if x2 and x3 have the same value they are assigned an average rank of (2+3)/2 = 2.5. The ranks are then sampled using a window of size n1 , and the sum of the ranks of the shots in this window (R1) calculated. The values of n1 and R1 are used to calculate a U statistic by

and, if the sample is sufficiently large (n1 ≥ 10), then this can be transformed to a Z statistic by

If we plot the set of Z statistics produced by applying this method to Top Hat we get the time series in Figure 1, which was constructed using a sampling window of 20 shots.The significance of the Z statistic can be determined with reference to a standard normal distribution. Thus if α = 0.05, the critical z-value is ± 1.96; and so when Z ≥ 1.96 we will identify a significant cluster of high-ranking shots (i.e. long takes) and when Z ≤ 1.96 we will identify a significant cluster of low ranking shots (i.e. short takes).

The series in Figure 1 contains a lot of redundant information because consecutive windows overlap the same shots (i.e. if n1 = 20 then nineteen of the shots in window 1 will also appear in window 2), and so the windows we are interested in are the most-significant non-overlapping windows.

Figure 1 Running Mann-Whitney Z statistics for Top Hat (1935) using a 20 shot window , with significance at Z = ± 1.96

From Figure 1, we can see that Top Hat has a number of peaks and troughs corresponding to clusters of longer and shorter shots.

The first peak (A) includes the meeting between Jerry and Horace at the beginning of the film that sets up the story and the first musical number ‘No Strings (I’m Fancy Free).’ Jerry’s performance of this number is interrupted by Dale, whom he has woken with his dancing, and there is a sequence of a more rapidly edited shot-reverse shot pattern that occurs at the first trough (1). After Dale returns to her room, Jerry decides to cover the floor with sand and dances to a reprise of ‘No Strings,’ and this can be seen in the second peak (B). The following morning, Jerry takes the place of a Hansome cab driver and escorts Dale to the stables, and this is a second quickly edited dialogue scene that occurs at 2. The peak at C occurs with the second musical number, ‘Isn’t This a Lovely Day (to be Caught in the Rain).’ The troughs at 3 and 4 coincide with Dale mistaking Jerry for Horace in the hotel lobby, and a subsequent sequence which cross-cuts between Dale and Jerry in different hotel rooms after the former has slapped the latter. The peak Z statistic occurs at D, which is the sequence at the theatre in which Jerry and Horace talk in the dressing room before Jerry goes on stage to perform ‘Top Hat, White Tie, and Tails.’ This first half of the films takes place in London, and the peaks and troughs are associated with particular aspects of the musical comedy: the peaks (i.e. the clusters of higher ranked and – therefore longer – shots) are associated with the musical numbers, while the troughs are associated with the comedy story line of the mix up in the romance of Dale and Jerry.

As the action moves to Italy, we get trough at 5, which is the sequence in which Dale and Madge chat next to the canal, and 6, which is another sequence cross-cut between locations as Dale and Jerry speak on the phone. The peak at E occurs at the third musical number, ‘Cheek to Cheek,’ and its extended dance sequence. So far we have the same pattern that we saw in the London sequences: comedy is quick and musical slow. However, the peak at F is not associated with a musical number and spans the three scenes. This peak includes the end of the sequence in which Jerry and Horace talk (after Madge has given her husband a black eye), the long static takes in which Dale accepts Beddini’s proposal (as the melody for ‘Cheek to Cheek’ is played in the background), and the beginning of the next sequence as Jerry and Horace are asked by the hotel to vacate the bridal suite and they go onto to talk to Madge. This peak (F) is the only case of the narrative being characterised by a cluster of long shots in the film.

The next trough (7) is the sequence in which Jerry meets Beddini in the bedroom of the bridal suite before he takes Dale out on the canal. The trough at 8 is a cluster of short shots in which the narrative of mistaken identity is resolved between Beddini, Horace, and Madge (but not Dale and Jerry), and is followed by the final peak G, which is the last of the musical numbers, ‘The Piccolino,’ and the carnival dance sequence.

Overall, we can see from the editing structure revealed by using the running Mann-Whitney Z statistic that Top Hat is characterised by alternating clusters of longer and shorter takes, in which the former are typically associated with the musical parts of the film and the latter with the comedy-romance narrative.

This method can also be used to compare different films side by side, and in a few weeks I’ll post a paper using this method to analyse the time series of 15 BBC News bulletins that places this data into a single frame of reference so similarities and differences can be identified.

The order structure matrix

The same information we obtained from the running Mann-Whitney Z statistic can be seen in the order structure matrix for Top Hat in Figure 2, based on whether a shot is greater than or less than the shot that comes after it (Brandt 2005). To construct the matrix we assign a value of 1 when xs ≥ xt and a value of 0 when xs < xt. To make this easier to visualise we assign a colour to each value (1 = black, 0 = white) and plot the matrix in a grid. The dark patches in Figure 2 correspond to the peaks in Figure 1 and exhibit clustering of longer shots in the films, while the light patches correspond to the troughs of Figure 1 and show where the clusters of shorter shots are to be found. Although this plot looks complicated, once you get used to the method and are familiar with the events of the film you can simply read the changes in cutting style from left to right.

Figure 2 Order structure matrix for Top Hat (1935)

Figure 2 was produced by first calculating the matrix in Microsoft Excel; and then cutting and pasting the resulting array of 1s and 0s into the latest version of PAST (which you can download for free here), selecting the whole spreadsheet, and then choosing MATRIX from the PLOT menu.

Alternatively, you can produce Figure 2 by applying the filled.contour command in R to the matrix (see here for an explanation).

This method has a particular limitation: it is only really effective with large data sets, and it can be quite difficult to make out distinct patterns even when there are as many as 250 shots in a film. If, however, you have 500 or more shots, then it is an excellent place to start your exploration of the shot length data for a film.


Brandt C 2005 Ordinal time series analysis, Ecological Modelling 182: 229-238. [There is an online version of this paper that can be downloaded for free, but there is no URL associated with it. Search for the title and you’ll find it].

Cutting JE, Brunik KL, and DeLong JE 2011 How act structure sculpts shot lengths and shot transitions in Hollywood film, Projections 5 (1): 1-16.

Dorai C and Venkatesh S 2001 Bridging the semantic gap in content management systems: computational media aesthetics, in Proceedings 2001 International Conference on Computational Semiotics in Games and New Media. 10-12 September 2001, Amsterdam: 94-99.

Kang H-B 2002 Analysis of scene context related with emotional events, in Proceedings of the 10th ACM International Conference on Multimedia. 1-6 December 2002, Juan les Pins, France: 311-314.

Mauget SA 2003 Intra- to multidecadal climate variability over the continental United States: 1932–99, Journal of Climate 16: 3905–3916.

Mauget SA 2011 Time series analysis based on running Mann-Whitney Z statistics, Journal of Time Series Analysis 32 (1): 47–53.

Yewdall DL 2007 Practical Art of Motion Picture Sound, third edition. Burlington, MA: Focus Press.

Young C 2007 Fast editing speed and commercial performance, Admap 483: 30-33.