Otthein Herzog & Philipp Kehl (Bremen)
Automatic Movie Trailer Generation based on Semantic Video Patterns
It is a common practice within the entertainment industry to summarize the content of a film in a short movie trailer while at the same time conveying a sample of its overall atmosphere. A similar task is faced by automatic video abstracting systems. But most of the existing approaches lack the power to analyze and interpret the semantic content of video data and to assess the dramatic and narrative importance of individual samples of footage.
This observation made us wonder if and to what extent it is feasible to automate the process of trailer production as an example of video abstracting with a high emphasis on semantic information. Inspired by the results of  and the commercial software muvee , a similar approach as in  was followed. To narrow down the problem areas it was decided to focus on trailers for action movies, since they tend to rely more on spectacular effects and visual action, which is easier to analyze and process than dialogue or emotions. Based on a manual analysis of different trailers nineteen basic categories of shots were identified, which are crucial for most action trailers. They include:
- Main character speaking in close-up,
- main character not speaking in close-up,
- main character speaking,
- main character not speaking,
- any character speaking in close-up,
- any character not speaking in close up,
- any character speaking,
- any character not speaking,
- long punch line spoken by any character,
- short punch line spoken by any character,
- shot featuring an explosion,
- shot featuring fire,
- shot featuring a gunshot,
- fast action,
- slow action,
- spectacular action,
- character shouting,
- character screaming, and
- display of setting.
In a next step a number of different audio and video processing methods was implemented to extract significant information in respect to these categories out of the data of a given movie. The video analysis comprises the detection of shots, human faces, written text, general movement within a shot and the visual identification of explosions.
The audio analysis comprises the detection of single sound events (including screams and gunshots), significant sound volume changes, the identification of specific spoken text (for example punch lines and remarkable quotes), and the general detection of music. All this information was stored as frame ranges of the analyzed movie in a XML file. In a next step formal descriptions of the nineteen shot-categories were developed, using the analyzed features as attributes. For example the category Setting was described as having slow movement, a low general sound volume, no visible text, a duration of more than two seconds, no visible faces of characters and no audible speech.
Now the system was able to identify each shot category as frame ranges within the analyzed movie. In a final step these shots (combined with automatically generated animations and a new footage-unrelated soundtrack of music and sound effects) were assembled in a short, trailer-like movie clip, based on their individual dramatic function within a typical Hollywood action trailer.
With this system the feasibility was demonstrated to achieve quite convincing results concerning an automatic trailer generation for action movies, a predetermined genre with specific properties. A user test of our automatically produced trailers showed that they were well accepted and in many ways comparable to professionally composed trailers.