The cosine similarity and its use in recommendation systems
Cosine similarity is a metric based on the cosine distance between two objects and can be used in recommendation systems such as movie and book recommenders. In this article, we will learn what it is and how it can be used to make recommendations by identifying similar items.
Introduction
Recommendation systems are part of our everyday life. Just to cite some common examples, we get recommendations when we shop online on Amazon and when we open Youtube or Netflix. The purpose of these systems is to recommend what you may want to buy based on what you have in your bag or what video or movie to watch next based on the ones you already watched.
A recommendation system can be as complicated as we want, such as the ones that use deep learning [1,2], but they can also be simple and based on the similarity between items. And one way to calculate the similarity between items is with the use of cosine similarity.
The cosine between two vectors
To understand this metric of similarity, we first need to understand some concepts. Suppose we have a table of books 1 and 2 containing their genre, according to Fig. 3. To each word in the genre table, we create another column in a second table, where if the word is in the genre, we give it 1, if it's not, then is 0. Since we have the genres Science Fiction and Fiction, we create another table with these two words. If we draw a graph where the x-axis is the Science axis and the Y-axis is the Fiction axis, we can associate a point to each book. For example, book 1 will be the blue point with a Sience-axis of 1 and a Fiction-axis of 1 (Science Fiction). Book 2 will be the yellow point with a Sience-axis of 0 and a Fiction-axis of 1 (Fiction). We draw a vector from the origin to the points, which we call the book-vector.
Now, we can see that the book-vectors form an angle θ with each other. The cosine of this angle is our measure of similarity, and it is given by:
where A and B are the vectors we are considering, ||A|| and ||B|| are their norm (length). The Ai and Bi in the formula are the components of each vector. Book-vector 1 is (1,1), and book-vector 2 is (0,1). Let’s calculate the cosine similarity:
which says two things: first, these vectors have some similarity, and second, θ is 45º, something that we already expected and could calculate by using the Pythagorean theorem and calculating the cosine using the sides of the triangle.
If both books were Science Fiction, we would have the same book-vectors (1,1) and the cosine would be 1, meaning they are the same. But if book 1 were Science Fiction (1,1) and book 2 Terror (0,0), in this case, they would have nothing in common, and the cosine would be 0. Therefore, high similarity means a cosine close to 1, and low similarity, a cosine close to 0.
Calculating using Python
We can cite at least two ways to calculate the cosine similarity between two given vectors. One is using numpy:
import numpy as np
from numpy.linalg import norm
A = np.array([1,8])
B = np.array([9,2])
cos_sim = np.dot(A,B)/(norm(A)*norm(B))
print(f"The cosine similarity is: {round(cos_sim,2)}")
Another way is using sklearn:
from sklearn.metrics.pairwise import cosine_similarity
cos_sim = cosine_similarity(A.reshape(1, -1),B.reshape(1, -1))
print(f"The cosine similarity is: {cos_sim}")
Either way, calculating the cosine similarity is straightforward and, if done by hand, only requires a simple mathematical formula.
Application as a book recommender
In this repository, I utilize the cosine similarity to build a book recommender.
For this project, the main idea is that a user will enter the books she/he read and classify them from 1 to 5, with 5 being the highest grade, thus creating a table that contains the columns: book title, author, description, genre, pages, and classification. Once the table is created, we convert it to pandas dataframe, then normalize using the function normalize(data) and one-hot encode the categorical columns using the function ohe(df, enc_col) (which is basically what we did when we create the other table in Fig. 2). It is important to normalize numerical columns since big numbers can bias the results.
The function recommend(book_id, owner_id, df) will calculate the cosine similarity between the book inputted by the user and all the other books in the table, creating a column with these values and then selecting the 10 most similar books to recommend to the user. The cosine similarity calculation itself is pretty straightforward and is defined by the function cosine_sim(v1, v2), similar to what we did with numpy, the only difference here is that the vectors have more than 2 dimensions.
You can run the API and test the recommendations.
Conclusions
The cosine similarity gives a useful measure of how similar two objects are. It is a rather simple mathematical concept and easy to implement computationally. It can be used for many purposes: in machine learning as a distance metric, with textual data to compare two documents, and in recommendation systems.
References
[1] Da’u, A., Salim, N. Recommendation system based on deep learning methods: a systematic review and new directions. Artif Intell Rev 53, 2709–2748 (2020). https://doi.org/10.1007/s10462-019-09744-1
[2] Schedl, M. (2019). Deep Learning in Music Recommendation Systems. Frontiers in Applied Mathematics and Statistics, 5. https://doi.org/10.3389/fams.2019.00044