csce470pine64backup/huddleston_a_hw3.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# CSCE 470 :: Information Storage and Retrieval :: Texas A&M University :: Fall 2017\n",
    "\n",
    "\n",
    "# Homework 3 and 4 United Forever:  Recommenders and Classification!\n",
    "\n",
    "### 200 points [10% of your final grade]\n",
    "\n",
    "### Due: November 16, 2017\n",
    "\n",
    "*Goals of this homework:* Put your knowledge of recommenders and classifiers to work. \n",
    "\n",
    "*Submission Instructions (ecampus):* To submit your homework, rename this notebook as  `lastname_firstinitial_hw#.ipynb`. For example, my homework submission would be: `caverlee_j_hw3.ipynb`. Submit this notebook via **ecampus**. Your IPython notebook should be completely self-contained, with the results visible in the notebook. We should not have to run any code from the command line, nor should we have to run your code within the notebook (though we reserve the right to do so)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Part 1: Recommending Movies\n",
    "\n",
    "For this first part, we're going to use part of the Movielens 100k dataset. Prior to the Netflix Prize, the Movielens data was **the** most important collection of movie ratings.\n",
    "\n",
    "First off, we need to load the data (see the data files in the \"Resources\" tab, including u.user, u.item, and ua.base). Here, we provide you with some helper code to load the data using [Pandas](http://pandas.pydata.org/). Pandas is a nice package for Python data analytics."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>MovieId</th>\n",
       "      <th>Title</th>\n",
       "      <th>UserId</th>\n",
       "      <th>Rating</th>\n",
       "      <th>Age</th>\n",
       "      <th>Gender</th>\n",
       "      <th>Occupation</th>\n",
       "      <th>ZipCode</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>Toy Story (1995)</td>\n",
       "      <td>1</td>\n",
       "      <td>5</td>\n",
       "      <td>24</td>\n",
       "      <td>M</td>\n",
       "      <td>technician</td>\n",
       "      <td>85711</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>GoldenEye (1995)</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>24</td>\n",
       "      <td>M</td>\n",
       "      <td>technician</td>\n",
       "      <td>85711</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>Four Rooms (1995)</td>\n",
       "      <td>1</td>\n",
       "      <td>4</td>\n",
       "      <td>24</td>\n",
       "      <td>M</td>\n",
       "      <td>technician</td>\n",
       "      <td>85711</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4</td>\n",
       "      <td>Get Shorty (1995)</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>24</td>\n",
       "      <td>M</td>\n",
       "      <td>technician</td>\n",
       "      <td>85711</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5</td>\n",
       "      <td>Copycat (1995)</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>24</td>\n",
       "      <td>M</td>\n",
       "      <td>technician</td>\n",
       "      <td>85711</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   MovieId              Title  UserId  Rating  Age Gender  Occupation ZipCode\n",
       "0        1   Toy Story (1995)       1       5   24      M  technician   85711\n",
       "1        2   GoldenEye (1995)       1       3   24      M  technician   85711\n",
       "2        3  Four Rooms (1995)       1       4   24      M  technician   85711\n",
       "3        4  Get Shorty (1995)       1       3   24      M  technician   85711\n",
       "4        5     Copycat (1995)       1       3   24      M  technician   85711"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "\n",
    "# Load the user data\n",
    "users_df = pd.read_csv('u.user', sep='|', names=['UserId', 'Age', 'Gender', 'Occupation', 'ZipCode'])\n",
    "\n",
    "# Load the movies data: we will only use movie id and title for this homework\n",
    "movies_df = pd.read_csv('u.item', sep='|', names=['MovieId', 'Title'], usecols=range(2))\n",
    "\n",
    "# Load the ratings data: ignore the timestamps\n",
    "ratings_df = pd.read_csv('ua.base', sep='\\t', names=['UserId', 'MovieId', 'Rating'],usecols=range(3))\n",
    "\n",
    "# Working on three different data frames is a pain\n",
    "# Let us create a single dataset by \"joining\" these three data frames\n",
    "movie_ratings_df = pd.merge(movies_df, ratings_df)\n",
    "movielens_df = pd.merge(movie_ratings_df, users_df)\n",
    "\n",
    "movielens_df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Part 1a. Let's Explore the Data [20 points]\n",
    "\n",
    "Before we get to the actual task of building our recommender, let's familiarize ourselves with the Movielens data.\n",
    "\n",
    "Pandas is really nice, since it let's us do simple aggregates. For example, we can find the top-10 movies with the most ratings like so:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Title\n",
      "Star Wars (1977)                  495\n",
      "Fargo (1996)                      443\n",
      "Return of the Jedi (1983)         439\n",
      "Contact (1997)                    412\n",
      "English Patient, The (1996)       400\n",
      "Liar Liar (1997)                  398\n",
      "Toy Story (1995)                  392\n",
      "Scream (1996)                     386\n",
      "Independence Day (ID4) (1996)     384\n",
      "Raiders of the Lost Ark (1981)    379\n",
      "dtype: int64\n"
     ]
    }
   ],
   "source": [
    "print movielens_df.groupby('Title').size().sort_values(ascending=False)[:10]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Top-10 movies\n",
    "OK, can you find the top-10 highest-rated movies? "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "                                            Rating  Ratings  Final\n",
      "Title                                                             \n",
      "Little City (1998)                               5        1    5.0\n",
      "Aiqing wansui (1994)                             5        1    5.0\n",
      "Someone Else's America (1995)                    5        1    5.0\n",
      "They Made Me a Criminal (1939)                   5        1    5.0\n",
      "Prefontaine (1997)                              15        3    5.0\n",
      "Great Day in Harlem, A (1994)                    5        1    5.0\n",
      "Star Kid (1997)                                 10        2    5.0\n",
      "Marlene Dietrich: Shadow and Light (1996)        5        1    5.0\n",
      "Saint of Fort Washington, The (1993)            10        2    5.0\n",
      "Santa with Muscles (1996)                       10        2    5.0\n"
     ]
    }
   ],
   "source": [
    "# your code here\n",
    "grouped = movie_ratings_df.groupby(['Title', 'Rating'])\n",
    "\n",
    "arrays = []\n",
    "\n",
    "for (title, rating), group in grouped:\n",
    "    arrays.append({'Title':title, 'Ratings':group.shape[0], 'Rating':rating * group.shape[0]})\n",
    "\n",
    "output_df = pd.DataFrame(arrays).groupby('Title').sum()\n",
    "output_df['Final'] = output_df['Rating'] / output_df['Ratings']\n",
    "\n",
    "print output_df.sort_values('Final', ascending=False)[:10]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Most polarizing movies\n",
    "Some movies draw a mixed reaction from fans -- where some people love them and some people hate them. Let's look for such *polarizing* movies that have lots of high ratings and lots of low ratings. \n",
    "\n",
    "For this part, let's define a **polarizing movie** as one meeting both of the following conditions:\n",
    "\n",
    "- The count of ratings that are 2, 3, or 4 < the count of ratings that are 1 or 5\n",
    "- |The count of 1 ratings - the count of 5 ratings| < 0.3 * Max(count of 1 ratings, count of 5 ratings)\n",
    "\n",
    "For example, a movie with ratings like:\n",
    "- 1 star = 100 ratings\n",
    "- 2 stars = 10 ratings \n",
    "- 3 stars = 10 ratings\n",
    "- 4 stars = 10 ratings\n",
    "- 5 stars = 80\n",
    "\n",
    "meets both of our conditions, since 10 + 10 + 10 < 100 + 80 (condition 1) and |100-80| < 0.3 * Max(100,80)\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "                  Ratings                                           Title\n",
      "0         [0, 0, 0, 0, 1]                                  Air Bud (1997)\n",
      "1         [2, 0, 0, 0, 4]                           Apartment, The (1960)\n",
      "2         [3, 0, 0, 4, 6]                    Bram Stoker's Dracula (1992)\n",
      "3     [1, 0, 22, 73, 128]                                   Casino (1995)\n",
      "4      [0, 7, 24, 56, 97]                             Citizen Ruth (1996)\n",
      "5       [1, 1, 9, 29, 64]                                 Clueless (1995)\n",
      "6      [0, 5, 33, 50, 89]              Dracula: Dead and Loving It (1995)\n",
      "7         [2, 0, 1, 0, 2]               E.T. the Extra-Terrestrial (1982)\n",
      "8   [6, 19, 29, 119, 179]                  Godfather: Part II, The (1974)\n",
      "9         [0, 0, 0, 0, 1]                      Great Dictator, The (1940)\n",
      "10        [1, 0, 0, 0, 2]                       Leave It to Beaver (1997)\n",
      "11        [0, 0, 1, 0, 2]                                Liar Liar (1997)\n",
      "12        [0, 0, 0, 0, 1]                   Little Lord Fauntleroy (1936)\n",
      "13        [0, 0, 0, 0, 1]                            Mars Attacks! (1996)\n",
      "14        [0, 0, 1, 0, 2]     Maybe, Maybe Not (Bewegte Mann, Der) (1994)\n",
      "15        [1, 1, 0, 1, 3]                   Miracle on 34th Street (1994)\n",
      "16       [2, 0, 4, 3, 11]                            Paradise Road (1997)\n",
      "17        [0, 0, 0, 2, 5]                           Paths of Glory (1957)\n",
      "18      [2, 0, 0, 12, 13]                                   Patton (1970)\n",
      "19        [0, 0, 0, 0, 3]                             Pretty Woman (1990)\n",
      "20  [19, 19, 48, 92, 170]                       Pump Up the Volume (1990)\n",
      "21        [0, 0, 0, 0, 2]                               Saint, The (1997)\n",
      "22        [0, 0, 0, 0, 2]        Savage Nights (Nuits fauves, Les) (1992)\n",
      "23    [3, 5, 21, 72, 168]                              Schizopolis (1996)\n",
      "24      [1, 2, 5, 11, 22]                            Shallow Grave (1994)\n",
      "25    [2, 5, 18, 81, 152]                          She's So Lovely (1997)\n",
      "26        [0, 0, 0, 0, 1]                  Something to Talk About (1995)\n",
      "27        [0, 0, 0, 0, 2]  Star Maker, The (Uomo delle stelle, L') (1995)\n",
      "28  [8, 13, 46, 151, 277]                                 Stargate (1994)\n",
      "29        [1, 0, 0, 0, 1]                             The Innocent (1994)\n",
      "30        [0, 0, 0, 0, 1]                   Thieves (Voleurs, Les) (1996)\n",
      "31   [4, 12, 35, 85, 142]                       To Be or Not to Be (1942)\n",
      "32        [1, 0, 0, 0, 1]       Touki Bouki (Journey of the Hyena) (1973)\n",
      "33        [1, 0, 0, 1, 2]                                   U Turn (1997)\n",
      "34    [2, 7, 24, 77, 141]                      Vampire in Brooklyn (1995)\n",
      "35      [0, 0, 5, 20, 34]                            War Room, The (1993)\n",
      "36        [1, 0, 1, 0, 3]                      Wrong Trousers, The (1993)\n",
      "37      [2, 1, 8, 29, 67]                               Wyatt Earp (1994)\n",
      "38        [2, 1, 0, 0, 2]                             You So Crazy (1994)\n"
     ]
    }
   ],
   "source": [
    "## your code here\n",
    "import numpy as np\n",
    "\n",
    "arrays = []\n",
    "for (title, rating), group in grouped:\n",
    "    arrays.append({'Title':title, 'Ratings':group.shape[0], 'Rating':rating})\n",
    "    \n",
    "output = pd.DataFrame(arrays).groupby(['Title', 'Rating', 'Ratings'])\n",
    "\n",
    "index_title = ''\n",
    "first = True\n",
    "\n",
    "ratings = [0,0,0,0,0]\n",
    "\n",
    "polar_array = []\n",
    "\n",
    "for key, item in output:\n",
    "    if first:\n",
    "        index_title = key[0]\n",
    "        first = False\n",
    "    elif index_title != key[0]:\n",
    "        index_title = key[0]\n",
    "        non_polar = ratings[1] + ratings[2] + ratings[3]\n",
    "        if non_polar < ratings[0] or non_polar < ratings[4]:\n",
    "            if (ratings[0] - ratings[4]) < 0.3 * np.maximum(ratings[0], ratings[4]):\n",
    "                polar_array.append({'Title':key[0], 'Ratings':ratings})\n",
    "        ratings = [0,0,0,0,0]\n",
    "    ratings[key[1] - 1] = item['Ratings'].values[0]\n",
    "polar_df = pd.DataFrame(polar_array)\n",
    "\n",
    "print polar_df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Part 1b: Find the Baseline ratings [30 points]\n",
    "\n",
    "Now let's find some estimated baseline ratings. Recall that the baseline rating for a user x on item i = the overall average rating + item bias for i + user bias for x. \n",
    "\n",
    "For the part, you should find the baseline ratings for several of our user/movie pairs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "ratings = [0.0,0.0,0.0,0.0,0.0]\n",
    "num_ratings = 0\n",
    "\n",
    "for key, item in output:\n",
    "    rating_index = item['Rating'].values[0]\n",
    "    current_num_ratings = item['Ratings'].values[0]\n",
    "    ratings[rating_index - 1] += current_num_ratings * rating_index\n",
    "    num_ratings += current_num_ratings\n",
    "\n",
    "# I don't have to calculate this anymore\n",
    "average_overall_rating = sum(ratings) / num_ratings\n",
    "\n",
    "# I need ratings sorted by UserId\n",
    "movielens_by_user = ratings_df.groupby(['UserId'])\n",
    "\n",
    "# I need ratings sorted by MovieId\n",
    "movielens_by_movie = ratings_df.groupby(['MovieId'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Baseline rating for user 1 for movie 155:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "3.17297656087\n"
     ]
    }
   ],
   "source": [
    "## your code here\n",
    "\n",
    "all_user_ratings = movielens_by_user.get_group(1)['Rating'].apply(lambda x : float(x))\n",
    "\n",
    "sum_user_rating = sum(all_user_ratings)\n",
    "\n",
    "num_user_rating = all_user_ratings.size\n",
    "\n",
    "average_user_rating = sum_user_rating / num_user_rating\n",
    "\n",
    "all_movie_ratings = movielens_by_movie.get_group(155)['Rating'].apply(lambda x : float(x))\n",
    "\n",
    "sum_movie_rating = sum(all_movie_ratings)\n",
    "\n",
    "num_movie_rating = all_movie_ratings.size\n",
    "\n",
    "average_movie_rating = sum_movie_rating / num_movie_rating\n",
    "\n",
    "print average_overall_rating + (average_user_rating - average_overall_rating) + (average_movie_rating - average_overall_rating)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Baseline rating for user 6 for movie 492:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "3.90046675197\n"
     ]
    }
   ],
   "source": [
    "## your code here\n",
    "\n",
    "all_user_ratings = movielens_by_user.get_group(6)['Rating'].apply(lambda x : float(x))\n",
    "\n",
    "sum_user_rating = sum(all_user_ratings)\n",
    "\n",
    "num_user_rating = all_user_ratings.size\n",
    "\n",
    "average_user_rating = sum_user_rating / num_user_rating\n",
    "\n",
    "all_movie_ratings = movielens_by_movie.get_group(492)['Rating'].apply(lambda x : float(x))\n",
    "\n",
    "sum_movie_rating = sum(all_movie_ratings)\n",
    "\n",
    "num_movie_rating = all_movie_ratings.size\n",
    "\n",
    "average_movie_rating = sum_movie_rating / num_movie_rating\n",
    "\n",
    "print average_overall_rating + (average_user_rating - average_overall_rating) + (average_movie_rating - average_overall_rating)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Baseline rating for user 21 for movie 164:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2.7054416276\n"
     ]
    }
   ],
   "source": [
    "## your code here\n",
    "\n",
    "all_user_ratings = movielens_by_user.get_group(21)['Rating'].apply(lambda x : float(x))\n",
    "\n",
    "sum_user_rating = sum(all_user_ratings)\n",
    "\n",
    "num_user_rating = all_user_ratings.size\n",
    "\n",
    "average_user_rating = sum_user_rating / num_user_rating\n",
    "\n",
    "all_movie_ratings = movielens_by_movie.get_group(164)['Rating'].apply(lambda x : float(x))\n",
    "\n",
    "sum_movie_rating = sum(all_movie_ratings)\n",
    "\n",
    "num_movie_rating = all_movie_ratings.size\n",
    "\n",
    "average_movie_rating = sum_movie_rating / num_movie_rating\n",
    "\n",
    "print average_overall_rating + (average_user_rating - average_overall_rating) + (average_movie_rating - average_overall_rating)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Part 1c. Please help me make a recommendation decision! [50 points]\n",
    "Suppose you're trying to recommend a movie to my friend Ellen (User 24). You are trying to decide between two movies:\n",
    "\n",
    "- Clueless (367); or\n",
    "- To Kill a Mockingbird (427) \n",
    "\n",
    "To build your recommender, you have many possibilities, including:\n",
    "\n",
    "1. Baseline estimate rating b_xi \n",
    "2. User-user collaborative filtering \n",
    "3. Item-item collaborative filtering\n",
    "4. Latent factor model\n",
    "5. Some other awesome methods ...\n",
    "\n",
    "First off, please make your best guess using the baseline rating estimate approach. Your output should like like:\n",
    "\n",
    "movie 367, rating: 2\n",
    "\n",
    "movie 427, rating: 3"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "movie 367, rating: 4.29535968632\n",
      "movie 427, rating: 5.1442330352\n"
     ]
    }
   ],
   "source": [
    "# your code for your baseline recommendation\n",
    "\n",
    "# Compute average rating for user 24\n",
    "all_user_ratings = movielens_by_user.get_group(24)['Rating'].apply(lambda x : float(x))\n",
    "\n",
    "sum_user_rating = sum(all_user_ratings)\n",
    "\n",
    "num_user_rating = all_user_ratings.size\n",
    "\n",
    "average_user_rating = sum_user_rating / num_user_rating\n",
    "\n",
    "# Compute average rating for movie 367\n",
    "all_movie_ratings_one = movielens_by_movie.get_group(367)['Rating'].apply(lambda x : float(x))\n",
    "\n",
    "sum_movie_rating = sum(all_movie_ratings_one)\n",
    "\n",
    "num_movie_rating = all_movie_ratings_one.size\n",
    "\n",
    "average_movie_rating = sum_movie_rating / num_movie_rating\n",
    "\n",
    "be_movie_one = average_overall_rating + (average_user_rating - average_overall_rating) + (average_movie_rating - average_overall_rating)\n",
    "\n",
    "# Compute average rating for movie 427\n",
    "all_movie_ratings_two = movielens_by_movie.get_group(427)['Rating'].apply(lambda x : float(x))\n",
    "\n",
    "sum_movie_rating = sum(all_movie_ratings_two)\n",
    "\n",
    "num_movie_rating = all_movie_ratings_two.size\n",
    "\n",
    "average_movie_rating = sum_movie_rating / num_movie_rating\n",
    "\n",
    "be_movie_two = average_overall_rating + (average_user_rating - average_overall_rating) + (average_movie_rating - average_overall_rating)\n",
    "\n",
    "print \"movie {}, rating: {}\".format(367, be_movie_one)\n",
    "print \"movie {}, rating: {}\".format(427, be_movie_two)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, update your baseline approach by incorporating item-item collaborative filtering. You have many design choices here (e.g., number of neighbors k, etc.). Do your best to make a good recommendation:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "movie 367, baseline rating: 4.29535968632 cf rating: 4.34697765759\n",
      "movie 427, baseline rating: 5.1442330352 cf rating: 4.35014425153\n"
     ]
    }
   ],
   "source": [
    "# your code here for augmenting baseline with item-item CF\n",
    "from scipy.spatial.distance import cosine\n",
    "\n",
    "# all_user_ratings_df is needed so we can pull each set of information individually.\n",
    "all_user_ratings_df = movielens_by_user.get_group(24)\n",
    "\n",
    "# all_movie_ratings_one_df has the information by UserId for 367 and all_movie_ratings_two_df has them for 427\n",
    "all_movie_ratings_one_df = movielens_by_movie.get_group(367)\n",
    "all_movie_ratings_two_df = movielens_by_movie.get_group(427)\n",
    "\n",
    "index = 0\n",
    "sum_cosine_similarity_and_rating_one = 0.0\n",
    "sum_cosine_similarity_one = 0.0\n",
    "sum_cosine_similarity_and_rating_two = 0.0\n",
    "sum_cosine_similarity_two = 0.0\n",
    "\n",
    "for movie in all_user_ratings_df['MovieId']:\n",
    "    \n",
    "    # Get the df of common users who rated the movie we're looking at.\n",
    "    current_movie_df = movielens_by_movie.get_group(movie).loc[movielens_by_movie.get_group(movie)['UserId'].isin(all_movie_ratings_one_df['UserId']) & movielens_by_movie.get_group(movie)['UserId'].isin(all_movie_ratings_two_df['UserId'])]\n",
    "    \n",
    "    # Get the df of common users who rated movie one (movie 367).\n",
    "    current_movie_one_df = all_movie_ratings_one_df.loc[all_movie_ratings_one_df['UserId'].isin(movielens_by_movie.get_group(movie)['UserId']) & all_movie_ratings_one_df['UserId'].isin(all_movie_ratings_two_df['UserId'])]\n",
    "    \n",
    "    # Get the df of common users who rated movie two (movie 427).\n",
    "    current_movie_two_df = all_movie_ratings_two_df.loc[all_movie_ratings_two_df['UserId'].isin(movielens_by_movie.get_group(movie)['UserId']) & all_movie_ratings_two_df['UserId'].isin(all_movie_ratings_one_df['UserId'])]\n",
    "    \n",
    "    # Compute two cosine similarities\n",
    "    cosine_one = 1 - cosine(current_movie_df['Rating'].values, current_movie_one_df['Rating'].values)\n",
    "    cosine_two = 1 - cosine(current_movie_df['Rating'].values, current_movie_two_df['Rating'].values)\n",
    "    \n",
    "    # Get the rating of this movie by user 24.\n",
    "    current_rating = all_user_ratings_df['Rating'].reset_index(drop = True)[index]\n",
    "    \n",
    "    # Save our sums.\n",
    "    sum_cosine_similarity_and_rating_one += cosine_one * float(current_rating)\n",
    "    sum_cosine_similarity_one += cosine_one\n",
    "    sum_cosine_similarity_and_rating_two += cosine_two * float(current_rating)\n",
    "    sum_cosine_similarity_two += cosine_two\n",
    "    \n",
    "    index += 1\n",
    "    \n",
    "final_rating_one = sum_cosine_similarity_and_rating_one / sum_cosine_similarity_one\n",
    "final_rating_two = sum_cosine_similarity_and_rating_two / sum_cosine_similarity_two\n",
    "\n",
    "print \"movie {}, baseline rating: {} cf rating: {}\".format(367, be_movie_one, final_rating_one)\n",
    "print \"movie {}, baseline rating: {} cf rating: {}\".format(427, be_movie_two, final_rating_two)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### BONUS: \n",
    "Can you use a latent factor model to create a new recommendation method? You can try using something like numpy.linalg.svd(...). [here's an example](http://www.frankcleary.com/svd/) and [here's another one](https://alyssaq.github.io/2015/20150426-simple-movie-recommender-using-svd/)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# your code here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Part 2: Classification with Yelp review data\n",
    "\n",
    "For this part, given a Yelp review, your task is to implement a classifier to predict if the business category of this review is \"food-relevant\" or not, **only based on the review text**. The data is from the [Yelp Dataset Challenge](https://www.yelp.com/dataset_challenge).\n",
    "\n",
    "## Build the training data\n",
    "\n",
    "First, you will need to download this data file as your training data: [training_data.json](https://drive.google.com/open?id=0B_13wIEAmbQMdzBVTndwenoxQlk) \n",
    "\n",
    "The training data file includes 40,000 Yelp reviews. Each line is a json-encoded review, and **you should only focus on the \"text\" field**. You should tokenize the review text by using the regular expression \"\\W+\". So something like wordlist = re.split('\\W+', text). Do NOT remove stop words. **Do casefolding but no stemming**.\n",
    "\n",
    "The label (class) information of each review is in the \"label\" field. It is **either \"Food-relevant\" or \"Food-irrelevant\"**.\n",
    "\n",
    "## Testing data\n",
    "\n",
    "We provide 100 yelp reviews here: [testing_data.json](https://drive.google.com/open?id=0B_13wIEAmbQMbXdyTkhrZDN4Wms). The testing data file has the same format as the training data file. Again, you can get the label informaiton in the \"label\" field. Only use it when you evalute your classifiers."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Build your Rocchio classifier [60 points]\n",
    "\n",
    "In this part, your job is to implement a Rocchio classifier for \"food-relevant vs. food-irrelevant\". You need to aggregate all the reviews of each class, and find the center. **Use the normalized raw term frequency**.\n",
    "\n",
    "\n",
    "### What to report\n",
    "\n",
    "* For the entire testing dataset, report the overall accuracy.\n",
    "* For the class \"Food-relevant\", report the precision and recall.\n",
    "* For the class \"Food-irrelevant\", report the precision and recall.\n",
    "\n",
    "We will also grade on the quality of your code. So make sure that your code is clear and readable."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Food relevant centroid: 0.0232983744754\n",
      "Food irrelevant centroid: 0.021168878992\n"
     ]
    }
   ],
   "source": [
    "# Build the Rocchio classifier\n",
    "# Insert as many cells as you want\n",
    "import pandas as pd\n",
    "import re\n",
    "\n",
    "# Load the data\n",
    "training_df = pd.read_json('training_data.json', lines=True)\n",
    "\n",
    "# Isolate just the DataFrames for each class.\n",
    "food_relevant_df = training_df.loc[training_df['label'] == \"Food-relevant\"]\n",
    "food_irrelevant_df = training_df.loc[training_df['label'] == \"Food-irrelevant\"]\n",
    "\n",
    "# I just want to mess with the text series.\n",
    "food_relevant_text_series = food_relevant_df['text']\n",
    "food_irrelevant_text_series = food_irrelevant_df['text']\n",
    "\n",
    "# Initialise some arrays to hold the means. In hindsight, this could just be dataframes too. \n",
    "food_relevant_centroids = []\n",
    "food_irrelevant_centroids = []\n",
    "\n",
    "# Calculate centroids per class\n",
    "for text in food_relevant_text_series:\n",
    "    food_relevant_tf_series = pd.Series(re.split('\\W+', text.lower().strip())).value_counts(normalize=True)\n",
    "    food_relevant_tf_df = pd.DataFrame({'term':food_relevant_tf_series.index, 'frequency':food_relevant_tf_series.values})\n",
    "    food_relevant_centroids.append(food_relevant_tf_df.mean())\n",
    "food_relevant_centroid = pd.DataFrame({'centroids':food_relevant_centroids}).mean()[0]\n",
    "    \n",
    "for text in food_irrelevant_text_series:\n",
    "    food_irrelevant_tf_series = pd.Series(re.split('\\W+', text.lower().strip())).value_counts(normalize=True)\n",
    "    food_irrelevant_tf_df = pd.DataFrame({'term':food_irrelevant_tf_series.index, 'frequency':food_irrelevant_tf_series.values})\n",
    "    food_irrelevant_centroids.append(food_irrelevant_tf_df.mean())\n",
    "food_irrelevant_centroid = pd.DataFrame({'centroids':food_irrelevant_centroids}).mean()[0]\n",
    "\n",
    "print \"Food relevant centroid: {}\\nFood irrelevant centroid: {}\".format(food_relevant_centroid, food_irrelevant_centroid)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Overall Accuracy: 0.45\n",
      "Food-relevant Precision: 0.709677419355\n",
      "Food-relevant Recall: 0.323529411765\n",
      "Food-irrelevant Precision: 0.333333333333\n",
      "Food-irrelevant Recall: 0.71875\n"
     ]
    }
   ],
   "source": [
    "## Apply your classifier on the test data. Report the results.\n",
    "# Insert as many cells as you want\n",
    "import numpy as np\n",
    "\n",
    "# Load the data\n",
    "testing_df = pd.read_json('testing_data.json', lines=True)\n",
    "\n",
    "# We can skip isolation and just get the text fields.\n",
    "text_series = testing_df['text']\n",
    "\n",
    "# Set index to 0 initially\n",
    "index = 0\n",
    "\n",
    "# For calculating precision / recall later, last one is just so it's more readable.\n",
    "food_relevant_true_positives = 0.0\n",
    "food_relevant_false_positives = 0.0\n",
    "food_irrelevant_true_positives = 0.0\n",
    "food_irrelevant_false_positives = 0.0\n",
    "total_items = text_series.size\n",
    "\n",
    "# Calculate centroids, make predictions.\n",
    "for text in text_series:\n",
    "    tf_series = pd.Series(re.split('\\W+', text.lower().strip())).value_counts(normalize=True)\n",
    "    centroid = pd.DataFrame({'term':tf_series.index, 'frequency':tf_series.values}).mean()[0]\n",
    "#    print \"Computed centroid: {}\".format(centroid)\n",
    "    \n",
    "    actual_label = testing_df['label'][index]\n",
    "    if np.absolute(food_relevant_centroid - centroid) < np.absolute(food_irrelevant_centroid - centroid):\n",
    "        if \"Food-relevant\" == actual_label:\n",
    "            food_relevant_true_positives += 1\n",
    "        else:\n",
    "            food_relevant_false_positives += 1\n",
    "#        print \"Prediction: Food-relevant\"\n",
    "#        print \"Actual: {}\".format(actual_label)\n",
    "    else:\n",
    "        if \"Food-irrelevant\" == actual_label:\n",
    "            food_irrelevant_true_positives += 1\n",
    "        else:\n",
    "            food_irrelevant_false_positives += 1\n",
    "#        print \"Prediction: Food-irrelevant\"\n",
    "#        print \"Actual: {}\".format(actual_label)\n",
    "    index += 1\n",
    "    \n",
    "# Print overall accuracy, then precision and recall per class.\n",
    "overall_accuracy = (food_relevant_true_positives + food_irrelevant_true_positives)/total_items\n",
    "food_relevant_precision = (food_relevant_true_positives)/(food_relevant_true_positives + food_relevant_false_positives)\n",
    "food_relevant_recall = (food_relevant_true_positives)/(food_relevant_true_positives + food_irrelevant_false_positives)\n",
    "food_irrelevant_precision = (food_irrelevant_true_positives)/(food_irrelevant_true_positives + food_irrelevant_false_positives)\n",
    "food_irrelevant_recall = (food_irrelevant_true_positives)/(food_irrelevant_true_positives + food_relevant_false_positives)\n",
    "\n",
    "print \"Overall Accuracy: {}\".format(overall_accuracy)\n",
    "print \"Food-relevant Precision: {}\".format(food_relevant_precision)\n",
    "print \"Food-relevant Recall: {}\".format(food_relevant_recall)\n",
    "print \"Food-irrelevant Precision: {}\".format(food_irrelevant_precision)\n",
    "print \"Food-irrelevant Recall: {}\".format(food_irrelevant_recall)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Improve your Rocchio classifier [40 points]\n",
    "\n",
    "OK, can you improve the quality of your classifier? Your goal here is to experiment with alternative weighting schemes, stopwords, etc. Whatever you like. See if you can improve the quality of your classifier."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Food relevant centroid: 0.0251792676427\n",
      "Food irrelevant centroid: 0.0229806644653\n"
     ]
    }
   ],
   "source": [
    "# Do whatever magic you need to improve your rocchio classifier\n",
    "import pandas as pd\n",
    "import re\n",
    "\n",
    "# Load the data\n",
    "training_df = pd.read_json('training_data.json', lines=True)\n",
    "\n",
    "#print (training_df['votes'])[0]['useful']\n",
    "\n",
    "# Isolate just the DataFrames for each class.\n",
    "food_relevant_df = training_df.loc[training_df['label'] == \"Food-relevant\"].reset_index()\n",
    "food_irrelevant_df = training_df.loc[training_df['label'] == \"Food-irrelevant\"].reset_index()\n",
    "\n",
    "# I just want to mess with the text series.\n",
    "food_relevant_text_series = food_relevant_df['text']\n",
    "food_irrelevant_text_series = food_irrelevant_df['text']\n",
    "\n",
    "# Initialise some arrays to hold the means. In hindsight, this could just be dataframes too. \n",
    "food_relevant_centroids = []\n",
    "food_irrelevant_centroids = []\n",
    "\n",
    "# Calculate centroids per class\n",
    "index = 0\n",
    "for text in food_relevant_text_series:\n",
    "    food_relevant_tf_series = pd.Series(re.split('\\W+', text.lower().strip())).value_counts(normalize=True)\n",
    "    food_relevant_tf_df = pd.DataFrame({'term':food_relevant_tf_series.index, 'frequency':food_relevant_tf_series.values})\n",
    "    food_relevant_centroids.append(food_relevant_tf_df.mean() * (1 + ((food_relevant_df['votes'])[index]['useful'])/ 10.0))\n",
    "    index += 1\n",
    "food_relevant_centroid = pd.DataFrame({'centroids':food_relevant_centroids}).mean()[0]\n",
    "    \n",
    "index = 0\n",
    "for text in food_irrelevant_text_series:\n",
    "    food_irrelevant_tf_series = pd.Series(re.split('\\W+', text.lower().strip())).value_counts(normalize=True)\n",
    "    food_irrelevant_tf_df = pd.DataFrame({'term':food_irrelevant_tf_series.index, 'frequency':food_irrelevant_tf_series.values})\n",
    "    food_irrelevant_centroids.append(food_irrelevant_tf_df.mean() * (1 + ((food_irrelevant_df['votes'])[index]['useful'])/ 10.0))\n",
    "    index += 1\n",
    "food_irrelevant_centroid = pd.DataFrame({'centroids':food_irrelevant_centroids}).mean()[0]\n",
    "\n",
    "print \"Food relevant centroid: {}\\nFood irrelevant centroid: {}\".format(food_relevant_centroid, food_irrelevant_centroid)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Overall Accuracy: 0.45\n",
      "Food-relevant Precision: 0.724137931034\n",
      "Food-relevant Recall: 0.308823529412\n",
      "Food-irrelevant Precision: 0.338028169014\n",
      "Food-irrelevant Recall: 0.75\n"
     ]
    }
   ],
   "source": [
    "# Apply your classifier on the test data. Report the results.\n",
    "import numpy as np\n",
    "\n",
    "# Load the data\n",
    "testing_df = pd.read_json('testing_data.json', lines=True)\n",
    "\n",
    "# We can skip isolation and just get the text fields.\n",
    "text_series = testing_df['text']\n",
    "\n",
    "# Set index to 0 initially\n",
    "index = 0\n",
    "\n",
    "# For calculating precision / recall later, last one is just so it's more readable.\n",
    "food_relevant_true_positives = 0.0\n",
    "food_relevant_false_positives = 0.0\n",
    "food_irrelevant_true_positives = 0.0\n",
    "food_irrelevant_false_positives = 0.0\n",
    "total_items = text_series.size\n",
    "\n",
    "# Calculate centroids, make predictions.\n",
    "for text in text_series:\n",
    "    tf_series = pd.Series(re.split('\\W+', text.lower().strip())).value_counts(normalize=True)\n",
    "    centroid = pd.DataFrame({'term':tf_series.index, 'frequency':tf_series.values}).mean()[0] * (1 + ((testing_df['votes'])[index]['useful'])/ 10.0)\n",
    "#    print \"Computed centroid: {}\".format(centroid)\n",
    "    \n",
    "    actual_label = testing_df['label'][index]\n",
    "    if np.absolute(food_relevant_centroid - centroid) < np.absolute(food_irrelevant_centroid - centroid):\n",
    "        if \"Food-relevant\" == actual_label:\n",
    "            food_relevant_true_positives += 1\n",
    "        else:\n",
    "            food_relevant_false_positives += 1\n",
    "#        print \"Prediction: Food-relevant\"\n",
    "#        print \"Actual: {}\".format(actual_label)\n",
    "    else:\n",
    "        if \"Food-irrelevant\" == actual_label:\n",
    "            food_irrelevant_true_positives += 1\n",
    "        else:\n",
    "            food_irrelevant_false_positives += 1\n",
    "#        print \"Prediction: Food-irrelevant\"\n",
    "#        print \"Actual: {}\".format(actual_label)\n",
    "    index += 1\n",
    "    \n",
    "# Print overall accuracy, then precision and recall per class.\n",
    "overall_accuracy = (food_relevant_true_positives + food_irrelevant_true_positives)/total_items\n",
    "food_relevant_precision = (food_relevant_true_positives)/(food_relevant_true_positives + food_relevant_false_positives)\n",
    "food_relevant_recall = (food_relevant_true_positives)/(food_relevant_true_positives + food_irrelevant_false_positives)\n",
    "food_irrelevant_precision = (food_irrelevant_true_positives)/(food_irrelevant_true_positives + food_irrelevant_false_positives)\n",
    "food_irrelevant_recall = (food_irrelevant_true_positives)/(food_irrelevant_true_positives + food_relevant_false_positives)\n",
    "\n",
    "print \"Overall Accuracy: {}\".format(overall_accuracy)\n",
    "print \"Food-relevant Precision: {}\".format(food_relevant_precision)\n",
    "print \"Food-relevant Recall: {}\".format(food_relevant_recall)\n",
    "print \"Food-irrelevant Precision: {}\".format(food_irrelevant_precision)\n",
    "print \"Food-irrelevant Recall: {}\".format(food_irrelevant_recall)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Explain your strategies.** What did you do? Did it work? Why? Give us your best analysis of the results."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<explanation goes here>\n",
    "I was able to receive very minal, mostly statistically insignificant improvement by simply adding a small scalar multiplier based on how many useful votes a post got.\n",
    "\n",
    "It's possible to increase accuracy by checking if reviews contain some words and multiplying centroid value on whether they contain them.\n",
    "\n",
    "I didn't expect very nominal accuracy changes, since the weighting scheme I used didn't differentiate the data much more."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### BONUS:\n",
    "\n",
    "Instead of Rocchio, implement any other classifier you like. How did it work out for you?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# your code here"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.14"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}