KEY Discovery Blog

Understanding Near-Duplicate Identification [Part Two]


We finished Part One of this blog post with a simplified example to demonstrate how one might be able to measure document similarity across a given corpus using a document-term matrix as a starting point. Let’s now turn back to R and our larger sample data set. As an exploratory step we can visualize the document-term matrix within a two-dimensional space. We can use a principal component analysis to summarize the variation within the data and then generate a plot of just the first two principal components. Each of the two principal components represents a grouping of the most relevant variables in describing how documents are distributed across the n-dimensional space. Using the first two principal components as our axes, we get the two-dimensional plot displayed at the beginning of this post (above), which represents a rough distribution of our documents.

Continue reading “Understanding Near-Duplicate Identification [Part Two]” »