Leetcode 1923. Longest Common Subpath

Problem statement

Solution 1

For me when I see this problem I thought about problem 1044. Longest Duplicate Substring, so I tried to implement this idea and I succeded in the end.

The idea is given length of substring of length M create rolling hashes and then check if we have the same hash for all elements in paths. We will use binary search to do search of M. The main difficulty here is that we have not letters from a to z but numbers from 1 to potentially 100000, so probability of collision can be quite high. That is why I chose several different primes around 2^31 and check calculate hashes for each of them. In this way we can be sure, that if we get that we have the same hashes for each of this modules, it means that paths are equal. Also we can do early stoppings to break as soon as it happens that intersections of hashes sets is empty.

Update Actually, evaluating hash with two given numbers is equavalent to evaluation hash to multiplication of this numbers. In python we can use as big number as possible, so if we choose big enough module we will pass. However how exaclty to choose this magical module is a good question: I choose number around 2^128 and it works fine for given tests. However we still can have collisions, and I think there is nothing we can do with it because of test case 71: we have paths = [[1, 0]*k, [0,1]*k] and when we evaluate hashes we have a lot of equal hashes for each of two persons. If we want to really compare strings we need to check O(k) options for each person, each of length O(M) and if M > k/2 (and it is, because anser is 2k-1), then we have complexity like O(k^2*log k) for full algorithm, which is impossible to get without TLE. So what is the solution? Use different method, like suffix tries.

Complexity

We need O(K) time to calculate all hashes for given M, where K = sum(paths[i].length) is total length of paths. We do binary search, so we have O(K*log T) complexity, where T is the smallest length among paths.

Code

class Solution:
    def longestCommonSubpath(self, n, paths):
        def RabinKarp(arr, M, q):
            h, t, d = (1<<(17*M-17))%q, 0, 1<<17
            all_hashes = set()

            for i in range(M): 
                t = (d * t + arr[i])% q

            all_hashes.add(t)

            for i in range(len(arr) - M):
                t = (d*(t-arr[i]*h) + arr[i + M])% q
                all_hashes.add(t)

            return all_hashes
    
        m, mod = len(paths), (1<<128) - 159
        beg, end = 0, min(len(p) for p in paths) + 1
        
        while beg + 1 < end:
            mid = (beg + end)//2
            tt = set.intersection(*[RabinKarp(p, mid, mod) for p in paths])

            if len(tt) != 0:
                beg = mid
            else:
                end = mid

        return beg

Solution 2, using suffix array

I was not very familiar with prefix array previously, so I thought this will be the good opportunity to investigate this data structure.

The idea is to build so-called suffix array: let us consider for simplicity example banana. Then what we want to construct is list of all suffixes: banana, anana, nana, ana, na, a and then sort them in increasing order: we have a, ana, anana, banana, na, nana. Actually what we keep is order if indexes: (5, 3, 1, 0, 4, 2). There are different ways how you can build it:

Very difficult, and quite heavy to implement with complexity O(n). Moreover hidden constant is quite big and in python it will work quite slow.
There is algorithm with complexity O(n*log n). You can see the idea here https://cp-algorithms.com/string/suffix-array.html, It is easier to code (about 20-30 lines of code) and work OK.
We can make previous approach in O(n*log^2 n) time, you can find details here http://web.stanford.edu/class/cs97si/suffix-array.pdf and actually in python it will work even faster than previous approach! And what I like the best that it can be written in very compact way in python, about 10-15 lines.
There is bruteforce approach with complexity O(n*n*log n), which can be coded in couple of lines, but is just too slow.

So, as you probably guessed, I use approach 3 here. Let us consider example [[0,1,2,3,4], [2,3,4], [4,0,1,2,3]] and do some preprocessing first:

We create one big list nums = [0, 1, 2, 3, 4, -1, 2, 3, 4, -2, 4, 0, 1, 2, 3, -3, -inf]: where we add numbers -1, ... -m between our arrays and minus infinity in the end.
Also we create array of group indexes prs = [1, 1, 1, 1, 1, ?, 2, 2, 2, ?, 3, 3, 3, 3, 3, ?]. Actually any symbol can be at question mark. Using this array we create dictionary parts, which given index will give us what part our index inside.
Also we use compare function, which will compare strings of equal length l, starting with indexes i and j. Suffix array allows us to do it just in O(1) time, please follow cp-algorithms link I provided earlier.
Create suffix array c from nums: actually what it have is log K layers, and create also sa: inversion of transposition of the last layer.
Now we do binary search with parameter mid. For this parameter we look at all substrings with length mid: there will be O(K) of them. Moreover, we have order of sorted prefixes, if we try to sort these substrings, we will have almost the same order, but now instead of strictly increasing we can have equalities: consider our banana case, if we choose mid = 2, we will have an, an, ba, na, na, so we compare all adjacent pairs and separate it into groups (an, an), (ba), (na, na).
Check every group: if it has less than m elements, we can skip this group, because we need at least m repetitions. If we have at least m elemens, we need to check that indexes cover all parts, that is all persons. Exactly for this we created parts dictionary: we look at every index and see what group it is inside. If we covered all groups, we are happy and we mark Found equal to True and break. Then we do binary search step: choose one of two halves.

Complexity

Time complexity to build suffix array is O(K * log^2K). Then we do binary search, each time spending O(K) time, so total time complexity is O(K * log ^2 K + K * log T). It barely passes tests. Space complexity is O(K). (see notations in previous approach)

Code

class Solution:
    def longestCommonSubpath(self, n, paths):
        def merge(l):
            ls = sorted(list(set(l)))
            index = {v: i for i, v in enumerate(ls)}
            return [index[v] for v in l]

        def suffixArray(s):
            line = merge(s)
            n, k, ans = len(s), 1, [line]
            while max(line) < n - 1:
                line = merge([a * (n + 1) + b + 1 for (a, b) in
                     zip_longest(line, islice(line, k, None), fillvalue = -1)])
                ans, k = ans + [line], k << 1
            return ans
        
        def compare(i, j, l, k):
            k1 = min(k, len(c) - 1)
            a = (c[k1][i], c[k1][(i+l-(1<<k))%n])
            b = (c[k1][j], c[k1][(j+l-(1<<k))%n])
            return 0 if a == b else 1 if a < b else -1
        
        nums, prs = [], []
        for i, p in enumerate(paths):
            nums.extend(p)
            nums.append(-1-i)
            prs.extend([i+1]*len(p))
            prs.append(-1-i)
        nums += [-float("inf")]
        
        n, m = len(nums), len(paths)
        
        c = suffixArray(nums)
        sa = [0]*(n+1)
        for i, k in enumerate(c[-1]):
            sa[k] = i
            
        parts = {j:i for i,j in zip(prs, range(n-1))}
                
        beg, end = 0, min(len(p) for p in paths) + 1
        
        while beg + 1 < end:
            mid = (beg + end)//2
            mid_log, a = floor(log2(mid)), 0

            groups = defaultdict(list)
            for i in range(1, n):
                q = compare(sa[i-1], sa[i], mid, mid_log)
                a += q
                groups[a].append(sa[i])

            Found = False

            for gr in groups.values():
                if len(gr) < m: continue
                if len(set(parts[ind] for ind in gr)) == m:
                    Found = True
                    break
                    
            if Found :
                beg = mid
            else:
                end = mid

        return beg