Data on movies is very useful from a statistical learning perspective. Instructors of statistics & machine learning programs use movie data instead of dryer & more esoteric data sets to explain key concepts. Since movies are universally understood, teaching statistics becomes easier since the domain is not that hard to understand.
Here are 10 great datasets on movies. Please feel free to add any I may have missed out.
GroupLens Research has collected and made available rating data sets from the MovieLens web site ( http://movielens.org ). The data sets were collected over various periods of time, depending on the size of the set. Before using these data sets, please review their README files for the usage licenses and other details.
This data set contains a list of over 10000 films including many older, odd, and cult films. There is information on actors, casts, directors, producers, studios, etc. The data is stored in relational form across several files. The central file (MAIN) is a list of movies, each with a unique identifier. These identifiers may change in successive versions. The actors (CAST) for those movies are listed with their roles in a distinct file. More information about individual actors (ACTORS) is in a third file. All directors in MAIN are listed in a fourth file (PEOPLE), with a number of important producers, writers, and cinematographers. A fifth file (REMAKES) links movies that were copied to a substantial extent from each other. The sixth file (STUDIOS) provides some information about studios shown in MAIN.
IMDB makes their raw data available. Unfortunately, the data is divided into many text files and the format of each file differs slightly. To create one data file containing all the desired information Wickham wrote a script in the ruby to extract the relevant information and store in a database. This data was then exported into csv for easy import into many programs. The following text files were downloaded and used:
- business.list. : Total budget
- Genres.list. Genres that a movie belongs to (eg. comedy and action)
- movies.list. Master list of all movie titles with year of production.
- mpaa-ratings-reasons.list. MPAA ratings.
- ratings.list. IMDB fan ratings.
- running-times.list. Movie length in minutes.
The Movie dataset contains weekend and daily per theater box office receipt data as well as total U.S. gross receipts for a set of 49 movies. Dates are provided for all time series values. The diverse list of movies was selected, not at random, but to spark student interest and to provide a range of box office values. The values provide a rich dataset to use for applications such as simple graphical analysis, a variety of time series and causal forecasting models, curve-fitting, and rate of change analysis. A series of assignment questions is included and the accompanying Instructor’s Manual provides representative solutions.
LinkedMDB publishes linked open data using the D2R Server . The project aims at publishing the first open semantic web database for movies, including a large number of interlinks to several datasets on the open data cloud and references to related web pages.
The data is stored here
The OMDb API is a free web service to obtain movie information, all content and images on the site are contributed and maintained by our users.
Freebase’s database of movie title & all its metadata includes more than 340,000 titles.
This page is a distribution site for movie-review data for use in sentiment-analysis experiments. Available are collections of movie-review documents labeled with respect to their overall sentiment polarity (positive or negative) or subjective rating (e.g., “two and a half stars”) and sentences labeled with respect to their subjectivity status (subjective or objective) or polarity .
This dataset contains faceted metadata describing contemporary American films, along with relevant judgements by actual human users. This dataset can be used for developing personalized faceted search interfaces, among other projects requiring rich, structured metadata. This dataset was used for the experiments presented at WWW 08 .
This dataset is a mashup of three seperate datasets: