Hybrid Recommendation Engine


Hybrid Recommender System

Intro

In this project we are building a Hybrid recommender system which is a combination of
1. Content Based Filtering (Tag Based) 
2. Collaborative Filtering Recommender System.
The goal of this exercise is to recommend an item for a particular user based on the above recommender systems.
We are going particularly implement the following paper in the link


Work Flow of the Recommender System

Like we mentioned the proposed system is comprised of 2 parts: 
1. Data set Preparation
2. Collaborative Filtering Stage
a. Creating User Profile based on User Related Tags
b. Perform Clustering Based on Similarity Behind the Users
 3. Content Based Filtering Stage
a. Users from each cluster has similar preferences, so we perform Content Based Recommender within each cluster.
b. Creating Item Profiles
c. Perform Cosine Similarity Clustering between User Profiles and Item Profiles
d. Calculate Tag Recency
4. Model Validation

Experimental Design

1.      Train and Test Set
a.      80% Training Data
b.      20% Test Data
2.      Evaluation Metric: Take up Rate%
a.      Hit Between Actual Item and Recommended Item
3.      User Clustering
4.      Removing Clusters where there are less than 5 members.
5.      Evaluation Measures
a.      Three Information Retrieval (IR) performance measures are considered in evaluating the effectiveness of the recommender System
                                                    i.     Recall
                                                   ii.     Precision
                                                 iii.     F1 Score

Step 1 User User Similarity Clustering


################################################################################################################################
#User User Similarity Clustering
################################################################################################################################
 
 
Tags=pd.read_csv("tags_elist.csv",encoding='utf-8')
UTF = Tags.groupby(['customer_id','tags'],as_index =False,sort =False).count().rename(columns ={'merch_id':'tag_count_UTF'})[['customer_id','tags','tag_count_UTF']]
Users_Distinct =Tags[['customer_id','tags']].drop_duplicates()

UDF =Users_Distinct.groupby(['tags'],as_index =False,sort=False).count().rename(columns={'customer_id':'tag_count_UDF'})[['tags','tag_count_UDF']]

a =math.log10(len(np.unique(Tags['customer_id'])))
UDF['UDF']=a-np.log10(UDF['tag_count_UDF'])
UTF =pd.merge(UTF,UDF,on ='tags',how ='left',sort =False)
UTF['UTF-UDF']=UTF['tag_count_UTF']*UTF['UDF']
Vect_len=UTF[['customer_id','UTF-UDF']]
Vect_len['UTF-UDF-Sq'] =Vect_len['UTF-UDF']**2
Vect_len = Vect_len.groupby('customer_id',as_index =False,sort =False).sum().rename(columns ={'UTF-UDF-Sq':'UTF-UDF-Sq-sum'})[['customer_id','UTF-UDF-Sq-sum']]
Vect_len['vect_len'] = np.sqrt(Vect_len[['UTF-UDF-Sq-sum']].sum(axis=1))
UTF =pd.merge(UTF,Vect_len,on='customer_id',how='left',sort =False)
UTF['UserTAG_WT']=UTF['UTF-UDF']/UTF['vect_len']

Sample =UTF['customer_id'][:10000].drop_duplicates()
sample_count =len(Sample)
data_ibs = pd.DataFrame(index=Sample,columns=Sample)
for i in range(0,len(data_ibs.columns)) :
    for j in range(0,len(data_ibs.columns)) :
        try:
            data_ibs.ix[i,j] = 1-cosine(UTF['UserTAG_WT'].ix[i-1:i],UTF['UserTAG_WT'].ix[j-1:j])
        except:
            pass
data_ibs = data_ibs.drop('UID6203', 1)
data_ibs =data_ibs.drop(['UID6203'])
num_clusters = 25
km = KMeans(n_clusters=num_clusters)
%time km.fit(data_ibs)
clusters = km.labels_.tolist()
joblib.dump(km,  'doc_cluster.pkl')
km = joblib.load('doc_cluster.pkl')
clusters = km.labels_.tolist()


Ucluster_Dict = {'customer_id':Sample[:sample_count-1],'clusters': clusters[:sample_count-1]}
Ucluster_Data = pd.DataFrame(Ucluster_Dict, columns = ['customer_id','clusters'])
Ucluster_Data = Ucluster_Data.sort_values(['clusters'], ascending=True)



Step 2 Computing User Tag Preference Clustering


################################################################################################################################
#Calculating Tf IDf Scores for finding User Tag Preference
################################################################################################################################



Tags=pd.read_csv("tags_elist.csv",encoding='utf-8')#movieId userId tag timestamp
#Ratings=pd.read_csv("ratings_temp.csv",encoding='utf-8') #userId movieId rating timestamp
TF = Tags.groupby(['merch_id','tags'],as_index =False,sort =False).count().rename(columns ={'customer_id':'tag_count_TF'})[['merch_id','tags','tag_count_TF']]
TF.is_copy = False
TF=pd.merge(TF,Tags,on=['tags','merch_id'],how='left',sort=False)
TAGS_Distinct =Tags[['merch_id','tags']].drop_duplicates()
DF =TAGS_Distinct.groupby(['tags'],as_index =False,sort=False).count().rename(columns={'merch_id':'tag_count_DF'})[['tags','tag_count_DF']]
a =math.log10(len(np.unique(Tags['merch_id'])))
DF['IDF']=a-np.log10(DF['tag_count_DF'])
TF =pd.merge(TF,DF,on ='tags',how ='left',sort =False)
TF['TF-IDF']=TF['tag_count_TF']*TF['IDF']
Vect_len=TF[['merch_id','TF-IDF']]
Vect_len['TF-IDF-Sq'] =Vect_len['TF-IDF']**2
Vect_len = Vect_len.groupby('merch_id',as_index =False,sort =False).sum().rename(columns ={'TF-IDF-Sq':'TF-IDF-Sq-sum'})[['merch_id','TF-IDF-Sq-sum']]
Vect_len['vect_len'] = np.sqrt(Vect_len[['TF-IDF-Sq-sum']].sum(axis=1))
TF =pd.merge(TF,Vect_len,on='merch_id',how='left',sort =False)
TF['TAG_WT']=TF['TF-IDF']/TF['vect_len']



user_tag_pref =pd.DataFrame()
tot_clusters =len(np.unique(clusters))
for cluster in range(tot_clusters):
    print ('\n','cluster: ',cluster,'out of: ',tot_clusters)
    Cluster_data =Ucluster_Data[Ucluster_Data['clusters']==cluster]
    distinct_cust_cluster =np.unique(Cluster_data['customer_id'])
#     print (distinct_cust_cluster)
    i=1
    for customer in distinct_cust_cluster:
        print ("user: ", i , "out of: ", len(distinct_cust_cluster))
        user_data = TF[TF['customer_id']==customer]
        user_data1 = user_data.groupby(['tags'], as_index = False, sort = False).sum().rename(columns = {'TAG_WT':'tag_pref'})[['tags','tag_pref']]
        user_data1['customer']=customer
        user_data1['clusters']=cluster
        user_tag_pref =user_tag_pref.append(user_data1,ignore_index =True)
        print (user_data1,'\n')
        i=i+1

Comments

Popular posts from this blog

Amount Withdrawn Model Part 2

Amount Withdrawn Model Part 1