Thursday, March 6, 2008

Loading NetFlix Prize Data into MySQL

I may be a year late, but I decided to enter into the NetFlix Prize contest the other day. I've already got MySQL (Lampp) running on my box (Fedora 8) so I decided to load the data into my DB to do some basic analytics. There's some good info out there, but nothing that encompassed all the steps I took, so I'm going to repeat my process here.

1. Enter the Contest and download the dataset. Nothing too complicated here, you just need a unique team name and an email address.

2. Get Ruby where it needs to be. For this I used yum to install both the ruby and ruby-devel packages.

2a. Then you need to download and install Ruby/MySQL. Here are the basic steps from http://lafcadio.rubyforge.org/manual/ch02s02.html

wget http://tmtm.org/downloads/mysql/ruby/mysql-ruby-2.7.tar.gz
tar zxvf mysql-ruby-2.7.tar.gz
cd mysql-ruby-2.7
ruby extconf.rb --with-mysql-config
make
ruby test.rb [hostname] [username] [dbpassword]
sudo make install


2b. After that is all set and you can connect to the MySQL w/ the test.rb script, download the ruby DBI from here: http://www.kitebird.com/articles/ruby-dbi.html#TOC_3


wget http://rubyforge.org/frs/download.php/12368/dbi-0.1.1.tar.gz
tar zxvf dbi-0.1.1.tar.gz
ruby setup.rb config --with=dbi,dbd_mysql
ruby setup.rb setup
sudo ruby setup.rb install


In the config command, I chose to only include the MySQL packages of the DBI. I think it's generally better to only include the packages you want as opposed to always including everything. If you don't specify which DB packages you want, the installer will assume you want everything. If you don't have all of the required development packages already on your machine (ie oracle and sybase) you'll get into errors when you run the setup command.

3. Now find a ruby script to load the data. I used this one from juretta:



#!/usr/bin/ruby
#
# Created by Stefan Saasen on 2007-04-24.
# Copyright (c) 2007. All rights reserved.
require 'dbi'

# Configuration
DB_DATABASE = 'netflix'
DB_USER = 'user'
DB_PASSWORD = '******'

# Directory with netflix dataset
NETFLIX_DOWNLOAD_DIR = File.join("/sata_ext/net_flix")

NETFLIX_MOVIE_TITLES = File.join(NETFLIX_DOWNLOAD_DIR, "movie_titles.txt")
NETFLIX_TRAINING_SET = File.join(NETFLIX_DOWNLOAD_DIR, "training_set")

# Profile...
start = Time.now

# Connect to the MySQL database
DBI.connect("DBI:Mysql:#{DB_DATABASE}", DB_USER, DB_PASSWORD) do |dbh|
# create the necessary tables
stmts = DATA.read.split(";").delete_if{|stmt| stmt.strip.empty?}
stmts.each do |stmt|
# Execute the current SQL statement
dbh.do(stmt)
end

# 1. Insert movie titles into movie_titles table using simple INSERT STATEMENTS
sql = "INSERT INTO movie_titles (id, year_of_release, title) VALUES(?,?,?)"
# It would be even more efficient to use INSERT statements that
# use VALUES syntax which can insert multiple rows.
# e.g. INSERT INTO movie_titles (a,b,c) VALUES(1,2,3),(4,5,6),(7,8,9);
dbh.prepare(sql) do |sth|
File.open(NETFLIX_MOVIE_TITLES) do |f|
f.each do |line|
# Execute the prepared INSERT Statement
id, year_of_release, title = line.split(",")
sth.execute id.to_i, year_of_release.strip, title.strip
end
end
end

# 2. Insert rating data from training_set using the fast LOAD DATA INFILE syntax
# The filename contains the movie id!
sql = "LOAD DATA INFILE ? INTO TABLE ratings FIELDS TERMINATED BY ',' "+
"IGNORE 1 LINES (customer_id, rating, date) SET movie_id = ?;"
dbh.prepare(sql) do |sth|
Dir[NETFLIX_TRAINING_SET + "/*.txt"].each do |trs_file|
puts "Importing #{trs_file}..."
if trs_file =~ /([0-9]+)\.txt$/
movie_id = $1.dup.to_i
else
raise "Missing movie_id (file: #{trs_file})"
end
# Execute the prepared statement using the .txt file and the movie_id
sth.execute File.expand_path(trs_file), movie_id
end
end

# Add index for certain columns
#puts "Creating index on the ratings table - may take a while"
#["CREATE INDEX mid_idx ON ratings (...)", "CREATE INDEX cid_idx ON ..."].each{|stmt| dbh.do(stmt)}
end
puts "Import successfully finished!\n"
puts "The netflix data import took #{sprintf("%0.2f", (Time.now - start)/60.0)} minutes"

__END__
DROP DATABASE IF EXISTS netflix;
CREATE DATABASE netflix /*!40100 DEFAULT CHARACTER SET latin1 */;
USE netflix;

DROP TABLE IF EXISTS `movie_titles`;
CREATE TABLE `movie_titles` (
`id` int(11) NOT NULL,
`year_of_release` YEAR(4) default NULL,
`title` VARCHAR(250),
PRIMARY KEY (`id`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;

DROP TABLE IF EXISTS `ratings`;
CREATE TABLE `ratings` (
`id` int(11) NOT NULL auto_increment,
`movie_id` int(11) NOT NULL,
`customer_id` int(11) NOT NULL,
`date` date default NULL,
`rating` int(1),
PRIMARY KEY (`id`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;


The last part there are the SQL commands to create the tables needed for this script.

I loaded all of the data onto my machine (Dual Intel 2.13 GHz w/ 1GB RAM) in about 13 minutes.

2 comments:

The Gastronomator! said...

Hi, I am trying to get my hands on the data set but now that the contest is over, it's hard. Would you be willing to help? Thanks!

Ronald Bradford said...

I likewise wanted to play with the data, and see it's closed. Do you still have the raw data files?

I wrote about the data source here?

Seeking public data for benchmarks