Text Categorization

Cosine Coefficient

The cosine coefficient is used to obtain a measure of similarity between query vector and the various documents in a collection. The same measure could be used to the determination of two documents. Thus, given the term vectors for two documents, DOCi and DOCj , the similarity between them may be defined as:

COSINE(DOCi , DOCj) = A/B
where
A = Sum(Term ik . Term jk)
B = Sqt.(Sum(Term ik)2) x Sqt.(Sum(Term jk)2)
(where k = 1,...,t)

The concept is from Vector inner products (dot products), Let say two vectors R and Q are defined as follows:
R = R1 + R2 + ... + Rt
Q = Q1 + Q2 + ... + Qt

The dot product is:

1) The Component method:
R.Q = R1 Q1 + R2Q2 + ... + RtQt

2) The Magnitude/Angle method:
R.Q = |R| . |Q| . cos(θ)

Thus:
|R| . |Q| . cos(θ) = R1 Q1 + R2Q2 + ... + RtQt
cos(θ) = (R1 Q1 + R2Q2 + ... + RtQt) / |R| . |Q|