Because of a lapse in government funding, the information on this website may not be up to date, transactions submitted via the website may not be processed, and the agency may not be able to respond to inquiries until appropriations are enacted. The NIH Clinical Center (the research hospital of NIH) is open. For more details about its operating status, please visit cc.nih.gov. Updates regarding government operating status and resumption of normal operations can be found at OPM.gov.

Text Categorization

Cosine Coefficient

The cosine coefficient is used to obtain a measure of similarity between query vector and the various documents in a collection. The same measure could be used to the determination of two documents. Thus, given the term vectors for two documents, DOCi and DOCj , the similarity between them may be defined as:

COSINE(DOCi , DOCj) = A/B
where
A = Sum(Term ik . Term jk)
B = Sqt.(Sum(Term ik)2) x Sqt.(Sum(Term jk)2)
(where k = 1,...,t)

The concept is from Vector inner products (dot products), Let say two vectors R and Q are defined as follows:
R = R1 + R2 + ... + Rt
Q = Q1 + Q2 + ... + Qt

The dot product is:

1) The Component method:
R.Q = R1 Q1 + R2Q2 + ... + RtQt

2) The Magnitude/Angle method:
R.Q = |R| . |Q| . cos(θ)

Thus:
|R| . |Q| . cos(θ) = R1 Q1 + R2Q2 + ... + RtQt
cos(θ) = (R1 Q1 + R2Q2 + ... + RtQt) / |R| . |Q|