https://leetcode.com/problems/longest-duplicate-substring

This is quite difficult problem in fact and you will be very unlucky if you have this on real interview. However if you familiar with Rabin-Karp algorighm, it will be not so difficult. You can see it in full details on Wikipedia: https://en.wikipedia.org/wiki/Rabin%E2%80%93Karp_algorithm

Here I briefly explain the idea of it. Imagine, that we have string abcabcddcb, and we want to check if there are two substrings of size 3, which are equal. Idea is to hash this substrings, using rolling hash, where d is our base and q is used to avoid overflow.

  1. for abc we evaluate [ord(a)*d^2 + ord(b)*d^1 + ord(c)*d^0] % q
  2. for bca we evaluate [ord(b)*d^2 + ord(c)*d^1 + ord(a)*d^0] % q Note, that we can evaluate rolling hash in O(1), for more details please see wikipedia. …

However, it can happen that we have collisions, we can falsely get two substrings with the same hash, which are not equal. So, we need to check our candidates.

Binary search for help

Note, that we are asked for the longest duplicate substring. If we found duplicate substring of length 10, it means that there are duplicate substrings of lenths 9,8, .... So, we can use binary search to find the longest one.

Complexity of algorighm is O(n log n) if we assume that probability of collision is low.

How to read the code

I have RabinKarp(text, M,q) function, where text is the string where we search patterns, M is the length we are looking for and q is prime modulo for Rabin-Karp algorighm.

  1. First, we need to choose d, I chose d = 256, because it is more than ord(z).
  2. Then we need to evaluate auxiliary value h, we need it for fast update of rolling hash.
  3. Evalute hash for first window
  4. Evaluate hashes for all other windows in O(n) (that is O(1) for each window), using evaluated h.
  5. We keep all hashes in dictionary: for each hash we keep start indexes of windows.
  6. Finally, we iterate over our dictionary and for each unique hash we check all possible combinations and compare not hashes, but original windows to make sure that it was not a collision.

Now, to the longestDupSubstring(S) function we run binary search, where we run our RabinKarp funcion at most log n times.

class Solution:
    def RabinKarp(self,text, M, q):
        if M == 0: return True
        h, t, d = (1<<(8*M-8))%q, 0, 256

        dic = defaultdict(list)

        for i in range(M): 
            t = (d * t + ord(text[i]))% q

        dic[t].append(i-M+1)

        for i in range(len(text) - M):
            t = (d*(t-ord(text[i])*h) + ord(text[i + M]))% q
            for j in dic[t]:
                if text[i+1:i+M+1] == text[j:j+M]:
                    return (True, text[j:j+M])
            dic[t].append(i+1)
        return (False, "")

    def longestDupSubstring(self, S):
        beg, end = 0, len(S)
        q = (1<<31) - 1 
        Found = ""
        while beg + 1 < end:
            mid = (beg + end)//2
            isFound, candidate = self.RabinKarp(S, mid, q)
            if isFound:
                beg, Found = mid, candidate
            else:
                end = mid

        return Found

Further discussion I did some research in this domain, and there are different ways to handle this problem:

  1. You can solve it, using suffix array in O(n log n) (or even O(n) with suffix trees) complexity, see https://cp-algorithms.com/string/suffix-array.html for suffix array which is a bit easier than suffix trees https://en.wikipedia.org/wiki/Suffix_tree
  2. You can solve it with Tries, complexity To be checked.
  3. I think there is solution where we use KMP, but I have not seen one yet.
  4. If you have more ideas, plese let me know!

If you like the solution, you can upvote it on leetcode discussion section: Problem 1044