# CSCE 470 :: Information Storage and Retrieval :: Texas A&M University :: Fall 2017


# Homework 3 and 4 United Forever:  Recommenders and Classification!

### 200 points [10% of your final grade]

### Due: November 16, 2017

*Goals of this homework:* Put your knowledge of recommenders and classifiers to work. 

*Submission Instructions (ecampus):* To submit your homework, rename this notebook as  `lastname_firstinitial_hw#.ipynb`. For example, my homework submission would be: `caverlee_j_hw3.ipynb`. Submit this notebook via **ecampus**. Your IPython notebook should be completely self-contained, with the results visible in the notebook. We should not have to run any code from the command line, nor should we have to run your code within the notebook (though we reserve the right to do so).

# Part 1: Recommending Movies

For this first part, we're going to use part of the Movielens 100k dataset. Prior to the Netflix Prize, the Movielens data was **the** most important collection of movie ratings.

First off, we need to load the data (see the data files in the "Resources" tab, including u.user, u.item, and ua.base). Here, we provide you with some helper code to load the data using [Pandas](http://pandas.pydata.org/). Pandas is a nice package for Python data analytics.

In [1]:
import pandas as pd

# Load the user data
users_df = pd.read_csv('u.user', sep='|', names=['UserId', 'Age', 'Gender', 'Occupation', 'ZipCode'])

# Load the movies data: we will only use movie id and title for this homework
movies_df = pd.read_csv('u.item', sep='|', names=['MovieId', 'Title'], usecols=range(2))

# Load the ratings data: ignore the timestamps
ratings_df = pd.read_csv('ua.base', sep='\t', names=['UserId', 'MovieId', 'Rating'],usecols=range(3))

# Working on three different data frames is a pain
# Let us create a single dataset by "joining" these three data frames
movie_ratings_df = pd.merge(movies_df, ratings_df)
movielens_df = pd.merge(movie_ratings_df, users_df)

movielens_df.head()

Unnamed: 0,MovieId,Title,UserId,Rating,Age,Gender,Occupation,ZipCode
0,1,Toy Story (1995),1,5,24,M,technician,85711
1,2,GoldenEye (1995),1,3,24,M,technician,85711
2,3,Four Rooms (1995),1,4,24,M,technician,85711
3,4,Get Shorty (1995),1,3,24,M,technician,85711
4,5,Copycat (1995),1,3,24,M,technician,85711


## Part 1a. Let's Explore the Data [20 points]

Before we get to the actual task of building our recommender, let's familiarize ourselves with the Movielens data.

Pandas is really nice, since it let's us do simple aggregates. For example, we can find the top-10 movies with the most ratings like so:

In [2]:
print movielens_df.groupby('Title').size().sort_values(ascending=False)[:10]

Title
Star Wars (1977)                  495
Fargo (1996)                      443
Return of the Jedi (1983)         439
Contact (1997)                    412
English Patient, The (1996)       400
Liar Liar (1997)                  398
Toy Story (1995)                  392
Scream (1996)                     386
Independence Day (ID4) (1996)     384
Raiders of the Lost Ark (1981)    379
dtype: int64


### Top-10 movies
OK, can you find the top-10 highest-rated movies? 

In [3]:
# your code here
grouped = movie_ratings_df.groupby(['Title', 'Rating'])

arrays = []

for (title, rating), group in grouped:
    arrays.append({'Title':title, 'Ratings':group.shape[0], 'Rating':rating * group.shape[0]})

output_df = pd.DataFrame(arrays).groupby('Title').sum()
output_df['Final'] = output_df['Rating'] / output_df['Ratings']

print output_df.sort_values('Final', ascending=False)[:10]

                                            Rating  Ratings  Final
Title                                                             
Little City (1998)                               5        1    5.0
Aiqing wansui (1994)                             5        1    5.0
Someone Else's America (1995)                    5        1    5.0
They Made Me a Criminal (1939)                   5        1    5.0
Prefontaine (1997)                              15        3    5.0
Great Day in Harlem, A (1994)                    5        1    5.0
Star Kid (1997)                                 10        2    5.0
Marlene Dietrich: Shadow and Light (1996)        5        1    5.0
Saint of Fort Washington, The (1993)            10        2    5.0
Santa with Muscles (1996)                       10        2    5.0


### Most polarizing movies
Some movies draw a mixed reaction from fans -- where some people love them and some people hate them. Let's look for such *polarizing* movies that have lots of high ratings and lots of low ratings. 

For this part, let's define a **polarizing movie** as one meeting both of the following conditions:

- The count of ratings that are 2, 3, or 4 < the count of ratings that are 1 or 5
- |The count of 1 ratings - the count of 5 ratings| < 0.3 * Max(count of 1 ratings, count of 5 ratings)

For example, a movie with ratings like:
- 1 star = 100 ratings
- 2 stars = 10 ratings 
- 3 stars = 10 ratings
- 4 stars = 10 ratings
- 5 stars = 80

meets both of our conditions, since 10 + 10 + 10 < 100 + 80 (condition 1) and |100-80| < 0.3 * Max(100,80)



In [4]:
## your code here
import numpy as np

arrays = []
for (title, rating), group in grouped:
    arrays.append({'Title':title, 'Ratings':group.shape[0], 'Rating':rating})
    
output = pd.DataFrame(arrays).groupby(['Title', 'Rating', 'Ratings'])

index_title = ''
first = True

ratings = [0,0,0,0,0]

polar_array = []

for key, item in output:
    if first:
        index_title = key[0]
        first = False
    elif index_title != key[0]:
        index_title = key[0]
        non_polar = ratings[1] + ratings[2] + ratings[3]
        if non_polar < ratings[0] or non_polar < ratings[4]:
            if (ratings[0] - ratings[4]) < 0.3 * np.maximum(ratings[0], ratings[4]):
                polar_array.append({'Title':key[0], 'Ratings':ratings})
        ratings = [0,0,0,0,0]
    ratings[key[1] - 1] = item['Ratings'].values[0]
polar_df = pd.DataFrame(polar_array)

print polar_df

                  Ratings                                           Title
0         [0, 0, 0, 0, 1]                                  Air Bud (1997)
1         [2, 0, 0, 0, 4]                           Apartment, The (1960)
2         [3, 0, 0, 4, 6]                    Bram Stoker's Dracula (1992)
3     [1, 0, 22, 73, 128]                                   Casino (1995)
4      [0, 7, 24, 56, 97]                             Citizen Ruth (1996)
5       [1, 1, 9, 29, 64]                                 Clueless (1995)
6      [0, 5, 33, 50, 89]              Dracula: Dead and Loving It (1995)
7         [2, 0, 1, 0, 2]               E.T. the Extra-Terrestrial (1982)
8   [6, 19, 29, 119, 179]                  Godfather: Part II, The (1974)
9         [0, 0, 0, 0, 1]                      Great Dictator, The (1940)
10        [1, 0, 0, 0, 2]                       Leave It to Beaver (1997)
11        [0, 0, 1, 0, 2]                                Liar Liar (1997)
12        [0, 0, 0, 0, 1]             

## Part 1b: Find the Baseline ratings [30 points]

Now let's find some estimated baseline ratings. Recall that the baseline rating for a user x on item i = the overall average rating + item bias for i + user bias for x. 

For the part, you should find the baseline ratings for several of our user/movie pairs.

In [5]:
ratings = [0.0,0.0,0.0,0.0,0.0]
num_ratings = 0

for key, item in output:
    rating_index = item['Rating'].values[0]
    current_num_ratings = item['Ratings'].values[0]
    ratings[rating_index - 1] += current_num_ratings * rating_index
    num_ratings += current_num_ratings

# I don't have to calculate this anymore
average_overall_rating = sum(ratings) / num_ratings

# I need ratings sorted by UserId
movielens_by_user = ratings_df.groupby(['UserId'])

# I need ratings sorted by MovieId
movielens_by_movie = ratings_df.groupby(['MovieId'])

Baseline rating for user 1 for movie 155:

In [6]:
## your code here

all_user_ratings = movielens_by_user.get_group(1)['Rating'].apply(lambda x : float(x))

sum_user_rating = sum(all_user_ratings)

num_user_rating = all_user_ratings.size

average_user_rating = sum_user_rating / num_user_rating

all_movie_ratings = movielens_by_movie.get_group(155)['Rating'].apply(lambda x : float(x))

sum_movie_rating = sum(all_movie_ratings)

num_movie_rating = all_movie_ratings.size

average_movie_rating = sum_movie_rating / num_movie_rating

print average_overall_rating + (average_user_rating - average_overall_rating) + (average_movie_rating - average_overall_rating)

3.17297656087


Baseline rating for user 6 for movie 492:

In [7]:
## your code here

all_user_ratings = movielens_by_user.get_group(6)['Rating'].apply(lambda x : float(x))

sum_user_rating = sum(all_user_ratings)

num_user_rating = all_user_ratings.size

average_user_rating = sum_user_rating / num_user_rating

all_movie_ratings = movielens_by_movie.get_group(492)['Rating'].apply(lambda x : float(x))

sum_movie_rating = sum(all_movie_ratings)

num_movie_rating = all_movie_ratings.size

average_movie_rating = sum_movie_rating / num_movie_rating

print average_overall_rating + (average_user_rating - average_overall_rating) + (average_movie_rating - average_overall_rating)

3.90046675197


Baseline rating for user 21 for movie 164:

In [8]:
## your code here

all_user_ratings = movielens_by_user.get_group(21)['Rating'].apply(lambda x : float(x))

sum_user_rating = sum(all_user_ratings)

num_user_rating = all_user_ratings.size

average_user_rating = sum_user_rating / num_user_rating

all_movie_ratings = movielens_by_movie.get_group(164)['Rating'].apply(lambda x : float(x))

sum_movie_rating = sum(all_movie_ratings)

num_movie_rating = all_movie_ratings.size

average_movie_rating = sum_movie_rating / num_movie_rating

print average_overall_rating + (average_user_rating - average_overall_rating) + (average_movie_rating - average_overall_rating)

2.7054416276


## Part 1c. Please help me make a recommendation decision! [50 points]
Suppose you're trying to recommend a movie to my friend Ellen (User 24). You are trying to decide between two movies:

- Clueless (367); or
- To Kill a Mockingbird (427) 

To build your recommender, you have many possibilities, including:

1. Baseline estimate rating b_xi 
2. User-user collaborative filtering 
3. Item-item collaborative filtering
4. Latent factor model
5. Some other awesome methods ...

First off, please make your best guess using the baseline rating estimate approach. Your output should like like:

movie 367, rating: 2

movie 427, rating: 3

In [9]:
# your code for your baseline recommendation

# Compute average rating for user 24
all_user_ratings = movielens_by_user.get_group(24)['Rating'].apply(lambda x : float(x))

sum_user_rating = sum(all_user_ratings)

num_user_rating = all_user_ratings.size

average_user_rating = sum_user_rating / num_user_rating

# Compute average rating for movie 367
all_movie_ratings_one = movielens_by_movie.get_group(367)['Rating'].apply(lambda x : float(x))

sum_movie_rating = sum(all_movie_ratings_one)

num_movie_rating = all_movie_ratings_one.size

average_movie_rating = sum_movie_rating / num_movie_rating

be_movie_one = average_overall_rating + (average_user_rating - average_overall_rating) + (average_movie_rating - average_overall_rating)

# Compute average rating for movie 427
all_movie_ratings_two = movielens_by_movie.get_group(427)['Rating'].apply(lambda x : float(x))

sum_movie_rating = sum(all_movie_ratings_two)

num_movie_rating = all_movie_ratings_two.size

average_movie_rating = sum_movie_rating / num_movie_rating

be_movie_two = average_overall_rating + (average_user_rating - average_overall_rating) + (average_movie_rating - average_overall_rating)

print "movie {}, rating: {}".format(367, be_movie_one)
print "movie {}, rating: {}".format(427, be_movie_two)

movie 367, rating: 4.29535968632
movie 427, rating: 5.1442330352


Now, update your baseline approach by incorporating item-item collaborative filtering. You have many design choices here (e.g., number of neighbors k, etc.). Do your best to make a good recommendation:

In [10]:
# your code here for augmenting baseline with item-item CF
from scipy.spatial.distance import cosine

# all_user_ratings_df is needed so we can pull each set of information individually.
all_user_ratings_df = movielens_by_user.get_group(24)

# all_movie_ratings_one_df has the information by UserId for 367 and all_movie_ratings_two_df has them for 427
all_movie_ratings_one_df = movielens_by_movie.get_group(367)
all_movie_ratings_two_df = movielens_by_movie.get_group(427)

index = 0
sum_cosine_similarity_and_rating_one = 0.0
sum_cosine_similarity_one = 0.0
sum_cosine_similarity_and_rating_two = 0.0
sum_cosine_similarity_two = 0.0

for movie in all_user_ratings_df['MovieId']:
    
    # Get the df of common users who rated the movie we're looking at.
    current_movie_df = movielens_by_movie.get_group(movie).loc[movielens_by_movie.get_group(movie)['UserId'].isin(all_movie_ratings_one_df['UserId']) & movielens_by_movie.get_group(movie)['UserId'].isin(all_movie_ratings_two_df['UserId'])]
    
    # Get the df of common users who rated movie one (movie 367).
    current_movie_one_df = all_movie_ratings_one_df.loc[all_movie_ratings_one_df['UserId'].isin(movielens_by_movie.get_group(movie)['UserId']) & all_movie_ratings_one_df['UserId'].isin(all_movie_ratings_two_df['UserId'])]
    
    # Get the df of common users who rated movie two (movie 427).
    current_movie_two_df = all_movie_ratings_two_df.loc[all_movie_ratings_two_df['UserId'].isin(movielens_by_movie.get_group(movie)['UserId']) & all_movie_ratings_two_df['UserId'].isin(all_movie_ratings_one_df['UserId'])]
    
    # Compute two cosine similarities
    cosine_one = 1 - cosine(current_movie_df['Rating'].values, current_movie_one_df['Rating'].values)
    cosine_two = 1 - cosine(current_movie_df['Rating'].values, current_movie_two_df['Rating'].values)
    
    # Get the rating of this movie by user 24.
    current_rating = all_user_ratings_df['Rating'].reset_index(drop = True)[index]
    
    # Save our sums.
    sum_cosine_similarity_and_rating_one += cosine_one * float(current_rating)
    sum_cosine_similarity_one += cosine_one
    sum_cosine_similarity_and_rating_two += cosine_two * float(current_rating)
    sum_cosine_similarity_two += cosine_two
    
    index += 1
    
final_rating_one = sum_cosine_similarity_and_rating_one / sum_cosine_similarity_one
final_rating_two = sum_cosine_similarity_and_rating_two / sum_cosine_similarity_two

print "movie {}, baseline rating: {} cf rating: {}".format(367, be_movie_one, final_rating_one)
print "movie {}, baseline rating: {} cf rating: {}".format(427, be_movie_two, final_rating_two)

movie 367, baseline rating: 4.29535968632 cf rating: 4.34697765759
movie 427, baseline rating: 5.1442330352 cf rating: 4.35014425153


### BONUS: 
Can you use a latent factor model to create a new recommendation method? You can try using something like numpy.linalg.svd(...). [here's an example](http://www.frankcleary.com/svd/) and [here's another one](https://alyssaq.github.io/2015/20150426-simple-movie-recommender-using-svd/).

In [11]:
# your code here

# Part 2: Classification with Yelp review data

For this part, given a Yelp review, your task is to implement a classifier to predict if the business category of this review is "food-relevant" or not, **only based on the review text**. The data is from the [Yelp Dataset Challenge](https://www.yelp.com/dataset_challenge).

## Build the training data

First, you will need to download this data file as your training data: [training_data.json](https://drive.google.com/open?id=0B_13wIEAmbQMdzBVTndwenoxQlk) 

The training data file includes 40,000 Yelp reviews. Each line is a json-encoded review, and **you should only focus on the "text" field**. You should tokenize the review text by using the regular expression "\W+". So something like wordlist = re.split('\W+', text). Do NOT remove stop words. **Do casefolding but no stemming**.

The label (class) information of each review is in the "label" field. It is **either "Food-relevant" or "Food-irrelevant"**.

## Testing data

We provide 100 yelp reviews here: [testing_data.json](https://drive.google.com/open?id=0B_13wIEAmbQMbXdyTkhrZDN4Wms). The testing data file has the same format as the training data file. Again, you can get the label informaiton in the "label" field. Only use it when you evalute your classifiers.

## Build your Rocchio classifier [60 points]

In this part, your job is to implement a Rocchio classifier for "food-relevant vs. food-irrelevant". You need to aggregate all the reviews of each class, and find the center. **Use the normalized raw term frequency**.


### What to report

* For the entire testing dataset, report the overall accuracy.
* For the class "Food-relevant", report the precision and recall.
* For the class "Food-irrelevant", report the precision and recall.

We will also grade on the quality of your code. So make sure that your code is clear and readable.

In [12]:
# Build the Rocchio classifier
# Insert as many cells as you want
import pandas as pd
import re

# Load the data
training_df = pd.read_json('training_data.json', lines=True)

# Isolate just the DataFrames for each class.
food_relevant_df = training_df.loc[training_df['label'] == "Food-relevant"]
food_irrelevant_df = training_df.loc[training_df['label'] == "Food-irrelevant"]

# I just want to mess with the text series.
food_relevant_text_series = food_relevant_df['text']
food_irrelevant_text_series = food_irrelevant_df['text']

# Initialise some arrays to hold the means. In hindsight, this could just be dataframes too. 
food_relevant_centroids = []
food_irrelevant_centroids = []

# Calculate centroids per class
for text in food_relevant_text_series:
    food_relevant_tf_series = pd.Series(re.split('\W+', text.lower().strip())).value_counts(normalize=True)
    food_relevant_tf_df = pd.DataFrame({'term':food_relevant_tf_series.index, 'frequency':food_relevant_tf_series.values})
    food_relevant_centroids.append(food_relevant_tf_df.mean())
food_relevant_centroid = pd.DataFrame({'centroids':food_relevant_centroids}).mean()[0]
    
for text in food_irrelevant_text_series:
    food_irrelevant_tf_series = pd.Series(re.split('\W+', text.lower().strip())).value_counts(normalize=True)
    food_irrelevant_tf_df = pd.DataFrame({'term':food_irrelevant_tf_series.index, 'frequency':food_irrelevant_tf_series.values})
    food_irrelevant_centroids.append(food_irrelevant_tf_df.mean())
food_irrelevant_centroid = pd.DataFrame({'centroids':food_irrelevant_centroids}).mean()[0]

print "Food relevant centroid: {}\nFood irrelevant centroid: {}".format(food_relevant_centroid, food_irrelevant_centroid)

Food relevant centroid: 0.0232983744754
Food irrelevant centroid: 0.021168878992


In [13]:
## Apply your classifier on the test data. Report the results.
# Insert as many cells as you want
import numpy as np

# Load the data
testing_df = pd.read_json('testing_data.json', lines=True)

# We can skip isolation and just get the text fields.
text_series = testing_df['text']

# Set index to 0 initially
index = 0

# For calculating precision / recall later, last one is just so it's more readable.
food_relevant_true_positives = 0.0
food_relevant_false_positives = 0.0
food_irrelevant_true_positives = 0.0
food_irrelevant_false_positives = 0.0
total_items = text_series.size

# Calculate centroids, make predictions.
for text in text_series:
    tf_series = pd.Series(re.split('\W+', text.lower().strip())).value_counts(normalize=True)
    centroid = pd.DataFrame({'term':tf_series.index, 'frequency':tf_series.values}).mean()[0]
#    print "Computed centroid: {}".format(centroid)
    
    actual_label = testing_df['label'][index]
    if np.absolute(food_relevant_centroid - centroid) < np.absolute(food_irrelevant_centroid - centroid):
        if "Food-relevant" == actual_label:
            food_relevant_true_positives += 1
        else:
            food_relevant_false_positives += 1
#        print "Prediction: Food-relevant"
#        print "Actual: {}".format(actual_label)
    else:
        if "Food-irrelevant" == actual_label:
            food_irrelevant_true_positives += 1
        else:
            food_irrelevant_false_positives += 1
#        print "Prediction: Food-irrelevant"
#        print "Actual: {}".format(actual_label)
    index += 1
    
# Print overall accuracy, then precision and recall per class.
overall_accuracy = (food_relevant_true_positives + food_irrelevant_true_positives)/total_items
food_relevant_precision = (food_relevant_true_positives)/(food_relevant_true_positives + food_relevant_false_positives)
food_relevant_recall = (food_relevant_true_positives)/(food_relevant_true_positives + food_irrelevant_false_positives)
food_irrelevant_precision = (food_irrelevant_true_positives)/(food_irrelevant_true_positives + food_irrelevant_false_positives)
food_irrelevant_recall = (food_irrelevant_true_positives)/(food_irrelevant_true_positives + food_relevant_false_positives)

print "Overall Accuracy: {}".format(overall_accuracy)
print "Food-relevant Precision: {}".format(food_relevant_precision)
print "Food-relevant Recall: {}".format(food_relevant_recall)
print "Food-irrelevant Precision: {}".format(food_irrelevant_precision)
print "Food-irrelevant Recall: {}".format(food_irrelevant_recall)

Overall Accuracy: 0.45
Food-relevant Precision: 0.709677419355
Food-relevant Recall: 0.323529411765
Food-irrelevant Precision: 0.333333333333
Food-irrelevant Recall: 0.71875


## Improve your Rocchio classifier [40 points]

OK, can you improve the quality of your classifier? Your goal here is to experiment with alternative weighting schemes, stopwords, etc. Whatever you like. See if you can improve the quality of your classifier.

In [14]:
# Do whatever magic you need to improve your rocchio classifier
import pandas as pd
import re

# Load the data
training_df = pd.read_json('training_data.json', lines=True)

#print (training_df['votes'])[0]['useful']

# Isolate just the DataFrames for each class.
food_relevant_df = training_df.loc[training_df['label'] == "Food-relevant"].reset_index()
food_irrelevant_df = training_df.loc[training_df['label'] == "Food-irrelevant"].reset_index()

# I just want to mess with the text series.
food_relevant_text_series = food_relevant_df['text']
food_irrelevant_text_series = food_irrelevant_df['text']

# Initialise some arrays to hold the means. In hindsight, this could just be dataframes too. 
food_relevant_centroids = []
food_irrelevant_centroids = []

# Calculate centroids per class
index = 0
for text in food_relevant_text_series:
    food_relevant_tf_series = pd.Series(re.split('\W+', text.lower().strip())).value_counts(normalize=True)
    food_relevant_tf_df = pd.DataFrame({'term':food_relevant_tf_series.index, 'frequency':food_relevant_tf_series.values})
    food_relevant_centroids.append(food_relevant_tf_df.mean() * (1 + ((food_relevant_df['votes'])[index]['useful'])/ 10.0))
    index += 1
food_relevant_centroid = pd.DataFrame({'centroids':food_relevant_centroids}).mean()[0]
    
index = 0
for text in food_irrelevant_text_series:
    food_irrelevant_tf_series = pd.Series(re.split('\W+', text.lower().strip())).value_counts(normalize=True)
    food_irrelevant_tf_df = pd.DataFrame({'term':food_irrelevant_tf_series.index, 'frequency':food_irrelevant_tf_series.values})
    food_irrelevant_centroids.append(food_irrelevant_tf_df.mean() * (1 + ((food_irrelevant_df['votes'])[index]['useful'])/ 10.0))
    index += 1
food_irrelevant_centroid = pd.DataFrame({'centroids':food_irrelevant_centroids}).mean()[0]

print "Food relevant centroid: {}\nFood irrelevant centroid: {}".format(food_relevant_centroid, food_irrelevant_centroid)


Food relevant centroid: 0.0251792676427
Food irrelevant centroid: 0.0229806644653


In [15]:
# Apply your classifier on the test data. Report the results.
import numpy as np

# Load the data
testing_df = pd.read_json('testing_data.json', lines=True)

# We can skip isolation and just get the text fields.
text_series = testing_df['text']

# Set index to 0 initially
index = 0

# For calculating precision / recall later, last one is just so it's more readable.
food_relevant_true_positives = 0.0
food_relevant_false_positives = 0.0
food_irrelevant_true_positives = 0.0
food_irrelevant_false_positives = 0.0
total_items = text_series.size

# Calculate centroids, make predictions.
for text in text_series:
    tf_series = pd.Series(re.split('\W+', text.lower().strip())).value_counts(normalize=True)
    centroid = pd.DataFrame({'term':tf_series.index, 'frequency':tf_series.values}).mean()[0] * (1 + ((testing_df['votes'])[index]['useful'])/ 10.0)
#    print "Computed centroid: {}".format(centroid)
    
    actual_label = testing_df['label'][index]
    if np.absolute(food_relevant_centroid - centroid) < np.absolute(food_irrelevant_centroid - centroid):
        if "Food-relevant" == actual_label:
            food_relevant_true_positives += 1
        else:
            food_relevant_false_positives += 1
#        print "Prediction: Food-relevant"
#        print "Actual: {}".format(actual_label)
    else:
        if "Food-irrelevant" == actual_label:
            food_irrelevant_true_positives += 1
        else:
            food_irrelevant_false_positives += 1
#        print "Prediction: Food-irrelevant"
#        print "Actual: {}".format(actual_label)
    index += 1
    
# Print overall accuracy, then precision and recall per class.
overall_accuracy = (food_relevant_true_positives + food_irrelevant_true_positives)/total_items
food_relevant_precision = (food_relevant_true_positives)/(food_relevant_true_positives + food_relevant_false_positives)
food_relevant_recall = (food_relevant_true_positives)/(food_relevant_true_positives + food_irrelevant_false_positives)
food_irrelevant_precision = (food_irrelevant_true_positives)/(food_irrelevant_true_positives + food_irrelevant_false_positives)
food_irrelevant_recall = (food_irrelevant_true_positives)/(food_irrelevant_true_positives + food_relevant_false_positives)

print "Overall Accuracy: {}".format(overall_accuracy)
print "Food-relevant Precision: {}".format(food_relevant_precision)
print "Food-relevant Recall: {}".format(food_relevant_recall)
print "Food-irrelevant Precision: {}".format(food_irrelevant_precision)
print "Food-irrelevant Recall: {}".format(food_irrelevant_recall)

Overall Accuracy: 0.45
Food-relevant Precision: 0.724137931034
Food-relevant Recall: 0.308823529412
Food-irrelevant Precision: 0.338028169014
Food-irrelevant Recall: 0.75


**Explain your strategies.** What did you do? Did it work? Why? Give us your best analysis of the results.

<explanation goes here>
I was able to receive very minal, mostly statistically insignificant improvement by simply adding a small scalar multiplier based on how many useful votes a post got.

It's possible to increase accuracy by checking if reviews contain some words and multiplying centroid value on whether they contain them.

I didn't expect very nominal accuracy changes, since the weighting scheme I used didn't differentiate the data much more.

...

### BONUS:

Instead of Rocchio, implement any other classifier you like. How did it work out for you?

In [16]:
# your code here