We finished Part One of this blog post with a simplified example to demonstrate how one might be able to measure document similarity across a given corpus using a document-term matrix as a starting point. Let’s now turn back to R and our larger sample … Continue reading
BLOG on Litigation Support and eDiscovery Industry
Understanding Near-Duplicate Identification [Part One]
Near-duplicate identification is one of the more common textual analytics tools used in eDiscovery. Not to be confused with document deduplication, which relies on hash values, near-duplicate identification calculates document similarity based off textual content. For example, if you had … Continue reading
An Easy Formula To Follow When You Have Some Data
It is 4:30 on a Friday afternoon and you have some data you need to review for an eDiscovery project. Time is ticking, deadlines are approaching and you need your data picked up and processed as soon as possible. When … Continue reading
