Sunday, March 6, 2011

Some Near duplicate detection

I've been reading a lot about Data Mining recently and today I came across Shingling, a method for near duplicate detection.

The algorithm is simple, it creates a set of consecutive substrings of fixed length on 2 two documents and calculates the Jaccard distance between these two sets. The Jaccard distance is a number between 1 and 0, inclusive, calculated as the size of the intersection of two sets divided by the size of their union. i.e |A ∩ B| / |A U B|. If this distance is 1 then the sets are equal.

Here's a simple implementation of the technique in python:

shingles = lambda s: set(s[i:i+3] for i in range(len(s)-2))
jaccard_distance = lambda seta, setb: len(seta & setb)/float(len(seta | setb))

a = raw_input("Original text: ")
b = raw_input("Other text: ")
sa = shingles(a)
sb = shingles(b)
print "These texts are %.2f%% similar" % (jaccard_distance(sa, sb)*100)

Another article I came across is MinHash which is used for the same purpose.