array
binary search
rolling hash
suffix array
]
Leetcode 1923. Longest Common Subpath
Problem statement
https://leetcode.com/problems/???/
Solution 1
For me when I see this problem I thought about problem 1044. Longest Duplicate Substring, so I tried to implement this idea and I succeded in the end.
The idea is given length of substring of length M
create rolling hashes and then check if we have the same hash for all elements in paths
. We will use binary search to do search of M
. The main difficulty here is that we have not letters from a
to z
but numbers from 1
to potentially 100000
, so probability of collision can be quite high. That is why I chose several different primes around 2^31
and check calculate hashes for each of them. In this way we can be sure, that if we get that we have the same hashes for each of this modules, it means that paths are equal. Also we can do early stoppings to break as soon as it happens that intersections of hashes sets is empty.
Update Actually, evaluating hash with two given numbers is equavalent to evaluation hash to multiplication of this numbers. In python we can use as big number as possible, so if we choose big enough module we will pass. However how exaclty to choose this magical module is a good question: I choose number around 2^128 and it works fine for given tests. However we still can have collisions, and I think there is nothing we can do with it because of test case 71: we have paths = [[1, 0]*k, [0,1]*k]
and when we evaluate hashes we have a lot of equal hashes for each of two persons. If we want to really compare strings we need to check O(k)
options for each person, each of length O(M)
and if M > k/2
(and it is, because anser is 2k-1
), then we have complexity like O(k^2*log k)
for full algorithm, which is impossible to get without TLE. So what is the solution? Use different method, like suffix tries.
Complexity
We need O(K)
time to calculate all hashes for given M
, where K = sum(paths[i].length)
is total length of paths. We do binary search, so we have O(K*log T)
complexity, where T
is the smallest length among paths.
Code
class Solution:
def longestCommonSubpath(self, n, paths):
def RabinKarp(arr, M, q):
h, t, d = (1<<(17*M-17))%q, 0, 1<<17
all_hashes = set()
for i in range(M):
t = (d * t + arr[i])% q
all_hashes.add(t)
for i in range(len(arr) - M):
t = (d*(t-arr[i]*h) + arr[i + M])% q
all_hashes.add(t)
return all_hashes
m, mod = len(paths), (1<<128) - 159
beg, end = 0, min(len(p) for p in paths) + 1
while beg + 1 < end:
mid = (beg + end)//2
tt = set.intersection(*[RabinKarp(p, mid, mod) for p in paths])
if len(tt) != 0:
beg = mid
else:
end = mid
return beg
Solution 2, using suffix array
I was not very familiar with prefix array previously, so I thought this will be the good opportunity to investigate this data structure.
The idea is to build so-called suffix array: let us consider for simplicity example banana
. Then what we want to construct is list of all suffixes: banana, anana, nana, ana, na, a
and then sort them in increasing order: we have a, ana, anana, banana, na, nana
. Actually what we keep is order if indexes: (5, 3, 1, 0, 4, 2)
. There are different ways how you can build it:
- Very difficult, and quite heavy to implement with complexity
O(n)
. Moreover hidden constant is quite big and in python it will work quite slow. - There is algorithm with complexity
O(n*log n)
. You can see the idea here https://cp-algorithms.com/string/suffix-array.html, It is easier to code (about 20-30 lines of code) and work OK. - We can make previous approach in
O(n*log^2 n)
time, you can find details here http://web.stanford.edu/class/cs97si/suffix-array.pdf and actually in python it will work even faster than previous approach! And what I like the best that it can be written in very compact way in python, about 10-15 lines. - There is bruteforce approach with complexity
O(n*n*log n)
, which can be coded in couple of lines, but is just too slow.
So, as you probably guessed, I use approach 3 here. Let us consider example [[0,1,2,3,4], [2,3,4], [4,0,1,2,3]]
and do some preprocessing first:
- We create one big list
nums = [0, 1, 2, 3, 4, -1, 2, 3, 4, -2, 4, 0, 1, 2, 3, -3, -inf]
: where we add numbers-1, ... -m
between our arrays and minus infinity in the end. - Also we create array of group indexes
prs = [1, 1, 1, 1, 1, ?, 2, 2, 2, ?, 3, 3, 3, 3, 3, ?]
. Actually any symbol can be at question mark. Using this array we create dictionaryparts
, which given index will give us what part our index inside. - Also we use
compare
function, which will compare strings of equal lengthl
, starting with indexesi
andj
. Suffix array allows us to do it just inO(1)
time, please follow cp-algorithms link I provided earlier. - Create suffix array
c
from nums: actually what it have islog K
layers, and create alsosa
: inversion of transposition of the last layer. - Now we do binary search with parameter
mid
. For this parameter we look at all substrings with lengthmid
: there will beO(K)
of them. Moreover, we have order of sorted prefixes, if we try to sort these substrings, we will have almost the same order, but now instead of strictly increasing we can have equalities: consider ourbanana
case, if we choosemid = 2
, we will havean, an, ba, na, na
, so we compare all adjacent pairs and separate it into groups(an, an), (ba), (na, na)
. - Check every group: if it has less than
m
elements, we can skip this group, because we need at leastm
repetitions. If we have at leastm
elemens, we need to check that indexes cover all parts, that is all persons. Exactly for this we createdparts
dictionary: we look at every index and see what group it is inside. If we covered all groups, we are happy and we markFound
equal toTrue
and break. Then we do binary search step: choose one of two halves.
Complexity
Time complexity to build suffix array is O(K * log^2K)
. Then we do binary search, each time spending O(K)
time, so total time complexity is O(K * log ^2 K + K * log T)
. It barely passes tests. Space complexity is O(K)
. (see notations in previous approach)
Code
class Solution:
def longestCommonSubpath(self, n, paths):
def merge(l):
ls = sorted(list(set(l)))
index = {v: i for i, v in enumerate(ls)}
return [index[v] for v in l]
def suffixArray(s):
line = merge(s)
n, k, ans = len(s), 1, [line]
while max(line) < n - 1:
line = merge([a * (n + 1) + b + 1 for (a, b) in
zip_longest(line, islice(line, k, None), fillvalue = -1)])
ans, k = ans + [line], k << 1
return ans
def compare(i, j, l, k):
k1 = min(k, len(c) - 1)
a = (c[k1][i], c[k1][(i+l-(1<<k))%n])
b = (c[k1][j], c[k1][(j+l-(1<<k))%n])
return 0 if a == b else 1 if a < b else -1
nums, prs = [], []
for i, p in enumerate(paths):
nums.extend(p)
nums.append(-1-i)
prs.extend([i+1]*len(p))
prs.append(-1-i)
nums += [-float("inf")]
n, m = len(nums), len(paths)
c = suffixArray(nums)
sa = [0]*(n+1)
for i, k in enumerate(c[-1]):
sa[k] = i
parts = {j:i for i,j in zip(prs, range(n-1))}
beg, end = 0, min(len(p) for p in paths) + 1
while beg + 1 < end:
mid = (beg + end)//2
mid_log, a = floor(log2(mid)), 0
groups = defaultdict(list)
for i in range(1, n):
q = compare(sa[i-1], sa[i], mid, mid_log)
a += q
groups[a].append(sa[i])
Found = False
for gr in groups.values():
if len(gr) < m: continue
if len(set(parts[ind] for ind in gr)) == m:
Found = True
break
if Found :
beg = mid
else:
end = mid
return beg