Thursday, March 6, 2008

Accent mode in Emacs 22

For a long time I used the emacs iso-accents-mode for my editing in Spanish. Looks like this has finally been removed from the standard emacs distros, so I had to figure out whatever they replaced it with. Here's what I (eventually) found: To get the same functionality as the old iso-accents-mode, you need to change the input mode. The official doc can be found here, but here are the basic commands.

M-x list-input-methods


C-x RETURN C-\ method RETURN
This is how to select the input method.

C-\
Toggles the input mode on/off. Pretty useful.

The mode I use is latin-1-prefix. It should be familiar to the iso-accents-mode followers.

Loading NetFlix Prize Data into MySQL

I may be a year late, but I decided to enter into the NetFlix Prize contest the other day. I've already got MySQL (Lampp) running on my box (Fedora 8) so I decided to load the data into my DB to do some basic analytics. There's some good info out there, but nothing that encompassed all the steps I took, so I'm going to repeat my process here.

1. Enter the Contest and download the dataset. Nothing too complicated here, you just need a unique team name and an email address.

2. Get Ruby where it needs to be. For this I used yum to install both the ruby and ruby-devel packages.

2a. Then you need to download and install Ruby/MySQL. Here are the basic steps from http://lafcadio.rubyforge.org/manual/ch02s02.html

wget http://tmtm.org/downloads/mysql/ruby/mysql-ruby-2.7.tar.gz
tar zxvf mysql-ruby-2.7.tar.gz
cd mysql-ruby-2.7
ruby extconf.rb --with-mysql-config
make
ruby test.rb [hostname] [username] [dbpassword]
sudo make install


2b. After that is all set and you can connect to the MySQL w/ the test.rb script, download the ruby DBI from here: http://www.kitebird.com/articles/ruby-dbi.html#TOC_3


wget http://rubyforge.org/frs/download.php/12368/dbi-0.1.1.tar.gz
tar zxvf dbi-0.1.1.tar.gz
ruby setup.rb config --with=dbi,dbd_mysql
ruby setup.rb setup
sudo ruby setup.rb install


In the config command, I chose to only include the MySQL packages of the DBI. I think it's generally better to only include the packages you want as opposed to always including everything. If you don't specify which DB packages you want, the installer will assume you want everything. If you don't have all of the required development packages already on your machine (ie oracle and sybase) you'll get into errors when you run the setup command.

3. Now find a ruby script to load the data. I used this one from juretta:



#!/usr/bin/ruby
#
# Created by Stefan Saasen on 2007-04-24.
# Copyright (c) 2007. All rights reserved.
require 'dbi'

# Configuration
DB_DATABASE = 'netflix'
DB_USER = 'user'
DB_PASSWORD = '******'

# Directory with netflix dataset
NETFLIX_DOWNLOAD_DIR = File.join("/sata_ext/net_flix")

NETFLIX_MOVIE_TITLES = File.join(NETFLIX_DOWNLOAD_DIR, "movie_titles.txt")
NETFLIX_TRAINING_SET = File.join(NETFLIX_DOWNLOAD_DIR, "training_set")

# Profile...
start = Time.now

# Connect to the MySQL database
DBI.connect("DBI:Mysql:#{DB_DATABASE}", DB_USER, DB_PASSWORD) do |dbh|
# create the necessary tables
stmts = DATA.read.split(";").delete_if{|stmt| stmt.strip.empty?}
stmts.each do |stmt|
# Execute the current SQL statement
dbh.do(stmt)
end

# 1. Insert movie titles into movie_titles table using simple INSERT STATEMENTS
sql = "INSERT INTO movie_titles (id, year_of_release, title) VALUES(?,?,?)"
# It would be even more efficient to use INSERT statements that
# use VALUES syntax which can insert multiple rows.
# e.g. INSERT INTO movie_titles (a,b,c) VALUES(1,2,3),(4,5,6),(7,8,9);
dbh.prepare(sql) do |sth|
File.open(NETFLIX_MOVIE_TITLES) do |f|
f.each do |line|
# Execute the prepared INSERT Statement
id, year_of_release, title = line.split(",")
sth.execute id.to_i, year_of_release.strip, title.strip
end
end
end

# 2. Insert rating data from training_set using the fast LOAD DATA INFILE syntax
# The filename contains the movie id!
sql = "LOAD DATA INFILE ? INTO TABLE ratings FIELDS TERMINATED BY ',' "+
"IGNORE 1 LINES (customer_id, rating, date) SET movie_id = ?;"
dbh.prepare(sql) do |sth|
Dir[NETFLIX_TRAINING_SET + "/*.txt"].each do |trs_file|
puts "Importing #{trs_file}..."
if trs_file =~ /([0-9]+)\.txt$/
movie_id = $1.dup.to_i
else
raise "Missing movie_id (file: #{trs_file})"
end
# Execute the prepared statement using the .txt file and the movie_id
sth.execute File.expand_path(trs_file), movie_id
end
end

# Add index for certain columns
#puts "Creating index on the ratings table - may take a while"
#["CREATE INDEX mid_idx ON ratings (...)", "CREATE INDEX cid_idx ON ..."].each{|stmt| dbh.do(stmt)}
end
puts "Import successfully finished!\n"
puts "The netflix data import took #{sprintf("%0.2f", (Time.now - start)/60.0)} minutes"

__END__
DROP DATABASE IF EXISTS netflix;
CREATE DATABASE netflix /*!40100 DEFAULT CHARACTER SET latin1 */;
USE netflix;

DROP TABLE IF EXISTS `movie_titles`;
CREATE TABLE `movie_titles` (
`id` int(11) NOT NULL,
`year_of_release` YEAR(4) default NULL,
`title` VARCHAR(250),
PRIMARY KEY (`id`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;

DROP TABLE IF EXISTS `ratings`;
CREATE TABLE `ratings` (
`id` int(11) NOT NULL auto_increment,
`movie_id` int(11) NOT NULL,
`customer_id` int(11) NOT NULL,
`date` date default NULL,
`rating` int(1),
PRIMARY KEY (`id`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;


The last part there are the SQL commands to create the tables needed for this script.

I loaded all of the data onto my machine (Dual Intel 2.13 GHz w/ 1GB RAM) in about 13 minutes.