Hybrid Recommendation Engine
Hybrid Recommender System
Intro
In this project we are building a Hybrid recommender system which is a combination of
1. Content Based Filtering (Tag Based)
2. Collaborative Filtering Recommender System.
The goal
of this exercise is to recommend an item for a particular user based on the
above recommender systems.
We are going particularly implement the following paper
in the link
Work Flow
of the Recommender System
Like we mentioned the proposed system is comprised of 2 parts:
1. Data set Preparation
2.
Collaborative Filtering Stage
a. Creating User Profile based on User Related Tags
b. Perform Clustering Based on Similarity Behind
the Users
3.
Content Based Filtering Stage
a. Users from each cluster has similar preferences,
so we perform Content Based Recommender within each cluster.
b. Creating Item Profiles
c. Perform Cosine Similarity Clustering between
User Profiles and Item Profiles
d. Calculate Tag Recency
4. Model
Validation
Experimental
Design
1. Train and Test Set
a. 80% Training Data
b. 20% Test Data
2.
Evaluation
Metric: Take up Rate%
a. Hit Between Actual Item and
Recommended Item
3.
User
Clustering
4.
Removing
Clusters where there are less than 5 members.
5.
Evaluation
Measures
a. Three Information Retrieval (IR)
performance measures are considered in evaluating the effectiveness of the
recommender System
i. Recall
ii. Precision
iii. F1 Score
Step 1 User User Similarity Clustering
################################################################################################################################ #User User Similarity Clustering ################################################################################################################################
Tags=pd.read_csv("tags_elist.csv",encoding='utf-8') UTF = Tags.groupby(['customer_id','tags'],as_index =False,sort =False).count().rename(columns ={'merch_id':'tag_count_UTF'})[['customer_id','tags','tag_count_UTF']] Users_Distinct =Tags[['customer_id','tags']].drop_duplicates() UDF =Users_Distinct.groupby(['tags'],as_index =False,sort=False).count().rename(columns={'customer_id':'tag_count_UDF'})[['tags','tag_count_UDF']] a =math.log10(len(np.unique(Tags['customer_id']))) UDF['UDF']=a-np.log10(UDF['tag_count_UDF']) UTF =pd.merge(UTF,UDF,on ='tags',how ='left',sort =False) UTF['UTF-UDF']=UTF['tag_count_UTF']*UTF['UDF'] Vect_len=UTF[['customer_id','UTF-UDF']] Vect_len['UTF-UDF-Sq'] =Vect_len['UTF-UDF']**2 Vect_len = Vect_len.groupby('customer_id',as_index =False,sort =False).sum().rename(columns ={'UTF-UDF-Sq':'UTF-UDF-Sq-sum'})[['customer_id','UTF-UDF-Sq-sum']] Vect_len['vect_len'] = np.sqrt(Vect_len[['UTF-UDF-Sq-sum']].sum(axis=1)) UTF =pd.merge(UTF,Vect_len,on='customer_id',how='left',sort =False) UTF['UserTAG_WT']=UTF['UTF-UDF']/UTF['vect_len'] Sample =UTF['customer_id'][:10000].drop_duplicates() sample_count =len(Sample) data_ibs = pd.DataFrame(index=Sample,columns=Sample) for i in range(0,len(data_ibs.columns)) : for j in range(0,len(data_ibs.columns)) : try: data_ibs.ix[i,j] = 1-cosine(UTF['UserTAG_WT'].ix[i-1:i],UTF['UserTAG_WT'].ix[j-1:j]) except: pass data_ibs = data_ibs.drop('UID6203', 1) data_ibs =data_ibs.drop(['UID6203']) num_clusters = 25 km = KMeans(n_clusters=num_clusters) %time km.fit(data_ibs) clusters = km.labels_.tolist() joblib.dump(km, 'doc_cluster.pkl') km = joblib.load('doc_cluster.pkl') clusters = km.labels_.tolist() Ucluster_Dict = {'customer_id':Sample[:sample_count-1],'clusters': clusters[:sample_count-1]} Ucluster_Data = pd.DataFrame(Ucluster_Dict, columns = ['customer_id','clusters']) Ucluster_Data = Ucluster_Data.sort_values(['clusters'], ascending=True)
Step 2 Computing User Tag Preference Clustering
################################################################################################################################ #Calculating Tf IDf Scores for finding User Tag Preference ################################################################################################################################ Tags=pd.read_csv("tags_elist.csv",encoding='utf-8')#movieId userId tag timestamp #Ratings=pd.read_csv("ratings_temp.csv",encoding='utf-8') #userId movieId rating timestamp TF = Tags.groupby(['merch_id','tags'],as_index =False,sort =False).count().rename(columns ={'customer_id':'tag_count_TF'})[['merch_id','tags','tag_count_TF']] TF.is_copy = False TF=pd.merge(TF,Tags,on=['tags','merch_id'],how='left',sort=False) TAGS_Distinct =Tags[['merch_id','tags']].drop_duplicates() DF =TAGS_Distinct.groupby(['tags'],as_index =False,sort=False).count().rename(columns={'merch_id':'tag_count_DF'})[['tags','tag_count_DF']] a =math.log10(len(np.unique(Tags['merch_id']))) DF['IDF']=a-np.log10(DF['tag_count_DF']) TF =pd.merge(TF,DF,on ='tags',how ='left',sort =False) TF['TF-IDF']=TF['tag_count_TF']*TF['IDF'] Vect_len=TF[['merch_id','TF-IDF']] Vect_len['TF-IDF-Sq'] =Vect_len['TF-IDF']**2 Vect_len = Vect_len.groupby('merch_id',as_index =False,sort =False).sum().rename(columns ={'TF-IDF-Sq':'TF-IDF-Sq-sum'})[['merch_id','TF-IDF-Sq-sum']] Vect_len['vect_len'] = np.sqrt(Vect_len[['TF-IDF-Sq-sum']].sum(axis=1)) TF =pd.merge(TF,Vect_len,on='merch_id',how='left',sort =False) TF['TAG_WT']=TF['TF-IDF']/TF['vect_len'] user_tag_pref =pd.DataFrame() tot_clusters =len(np.unique(clusters)) for cluster in range(tot_clusters): print ('\n','cluster: ',cluster,'out of: ',tot_clusters) Cluster_data =Ucluster_Data[Ucluster_Data['clusters']==cluster] distinct_cust_cluster =np.unique(Cluster_data['customer_id']) # print (distinct_cust_cluster) i=1 for customer in distinct_cust_cluster: print ("user: ", i , "out of: ", len(distinct_cust_cluster)) user_data = TF[TF['customer_id']==customer] user_data1 = user_data.groupby(['tags'], as_index = False, sort = False).sum().rename(columns = {'TAG_WT':'tag_pref'})[['tags','tag_pref']] user_data1['customer']=customer user_data1['clusters']=cluster user_tag_pref =user_tag_pref.append(user_data1,ignore_index =True) print (user_data1,'\n') i=i+1
Comments
Post a Comment