Overview
This is an O(n log n) algorithm for suffix array construction (or rather, it would be, if instead of ::sort
a 2-pass bucket sort had been used).
It works by first sorting the 2-grams(*), then the 4-grams, then the 8-grams, and so forth, of the original string S
, so in the i-th iteration, we sort the 2i-grams. There can obviously be no more than log2(n) such iterations, and the trick is that sorting the 2i-grams in the i-th step is facilitated by making sure that each comparison of two 2i-grams is done in O(1) time (rather than O(2i) time).
How does it do this? Well, in the first iteration it sorts the 2-grams (aka bigrams), and then performs what is called lexicographic renaming. This means it creates a new array (of length n
) that stores, for each bigram, its rank in the bigram sorting.
Example for lexicographic renaming: Say we have a sorted list of some bigrams {'ab','ab','ca','cd','cd','ea'}
. We then assign ranks (i.e. lexicographic names) by going from left to right, starting with rank 0 and incrementing the rank whenever we encounter a new bigram changes. So the ranks we assign are as follows:
ab : 0
ab : 0 [no change to previous]
ca : 1 [increment because different from previous]
cd : 2 [increment because different from previous]
cd : 2 [no change to previous]
ea : 3 [increment because different from previous]
These ranks are known as lexicographic names.
Now, in the next iteration, we sort 4-grams. This involves a lot of comparisons between different 4-grams. How do we compare two 4-grams? Well, we could compare them character by character. That would be up to 4 operations per comparison. But instead, we compare them by looking up the ranks of the two bigrams contained in them, using the rank table generated in the previous steps. That rank represents the lexicographic rank from the previous 2-gram sort, so if for any given 4-gram, its first 2-gram has a higher rank than the first 2-gram of another 4-gram, then it must be lexicographically greater somewhere in the first two characters. Hence, if for two 4-grams the rank of the first 2-gram is identical, they must be identical in the first two characters. In other words, two look-ups in the rank table are sufficient to compare all 4 characters of the two 4-grams.
After sorting, we create new lexicographic names again, this time for the 4-grams.
In the third iteration, we need to sort by 8-grams. Again, two look-ups in the lexicographic rank table from the previous step are sufficient to compare all 8 characters of two given 8-grams.
And so forth. Each iteration i
has two steps:
Sorting by 2i-grams, using the lexicographic names from the previous iteration to enable comparisons in 2 steps (i.e. O(1) time) each
Creating new lexicographic names
We repeat this until all 2i-grams are different. If that happens, we are done. How do we know if all are different? Well, the lexicographic names are an increasing sequence of integers, starting with 0. So if the highest lexicographic name generated in an iteration is the same as n-1
, then each 2i-gram must have been given its own, distinct lexicographic name.
Implementation
Now let's look at the code to confirm all of this. The variables used are as follows: sa[]
is the suffix array we are building. pos[]
is the rank lookup-table (i.e. it contains the lexicographic names), specifically, pos[k]
contains the lexicographic name of the k
-th m-gram of the previous step. tmp[]
is an auxiliary array used to help create pos[]
.
I'll give further explanations between the code lines:
void buildSA()
{
N = strlen(S);
/* This is a loop that initializes sa[] and pos[].
For sa[] we assume the order the suffixes have
in the given string. For pos[] we set the lexicographic
rank of each 1-gram using the characters themselves.
That makes sense, right? */
REP(i, N) sa[i] = i, pos[i] = S[i];
/* Gap is the length of the m-gram in each step, divided by 2.
We start with 2-grams, so gap is 1 initially. It then increases
to 2, 4, 8 and so on. */
for (gap = 1;; gap *= 2)
{
/* We sort by (gap*2)-grams: */
sort(sa, sa + N, sufCmp);
/* We compute the lexicographic rank of each m-gram
that we have sorted above. Notice how the rank is computed
by comparing each n-gram at position i with its
neighbor at i+1. If they are identical, the comparison
yields 0, so the rank does not increase. Otherwise the
comparison yields 1, so the rank increases by 1. */
REP(i, N - 1) tmp[i + 1] = tmp[i] + sufCmp(sa[i], sa[i + 1]);
/* tmp contains the rank by position. Now we map this
into pos, so that in the next step we can look it
up per m-gram, rather than by position. */
REP(i, N) pos[sa[i]] = tmp[i];
/* If the largest lexicographic name generated is
n-1, we are finished, because this means all
m-grams must have been different. */
if (tmp[N - 1] == N - 1) break;
}
}
About the comparison function
The function sufCmp
is used to compare two (2*gap)-grams lexicographically. So in the first iteration it compares bigrams, in the second iteration 4-grams, then 8-grams and so on. This is controlled by gap
, which is a global variable.
A naive implementation of sufCmp
would be this:
bool sufCmp(int i, int j)
{
int pos_i = sa[i];
int pos_j = sa[j];
int end_i = pos_i + 2*gap;
int end_j = pos_j + 2*gap;
if (end_i > N)
end_i = N;
if (end_j > N)
end_j = N;
while (i < end_i && j < end_j)
{
if (S[pos_i] != S[pos_j])
return S[pos_i] < S[pos_j];
pos_i += 1;
pos_j += 1;
}
return (pos_i < N && pos_j < N) ? S[pos_i] < S[pos_j] : pos_i > pos_j;
}
This would compare the (2*gap)-gram at the beginning of the i-th suffix pos_i:=sa[i]
with the one found at the beginning of the j-th suffix pos_j:=sa[j]
. And it would compare them character by character, i.e. comparing S[pos_i]
with S[pos_j]
, then S[pos_i+1]
with S[pos_j+1]
and so on. It continues as long as the characters are identical. Once they differ, it returns 1 if the character in the i-th suffix is smaller than the one in the j-th suffix, 0 otherwise. (Note that return a<b
in a function returning int
means you return 1 if the condition is true, and 0 if it is false.)
The complicated looking condition in the return-statement deals with the case that one of the (2*gap)-grams is located at the end of the string. In this case either pos_i
or pos_j
will reach N
before all (2*gap) characters have been compared, even if all characters up to that point are identical. It will then return 1 if the i-th suffix is at the end, and 0 if the j-th suffix is at the end. This is correct because if all characters are identical, the shorter one is lexicographically smaller. If pos_i
has reached the end, the i-th suffix must be shorter than the j-th suffix.
Clearly, this naive implementation is O(gap), i.e. its complexity is linear in the length of the (2*gap)-grams. The function used in your code, however, uses the lexicographic names to bring this down to O(1) (specifically, down to a maximum of two comparisons):
bool sufCmp(int i, int j)
{
if (pos[i] != pos[j])
return pos[i] < pos[j];
i += gap;
j += gap;
return (i < N && j < N) ? pos[i] < pos[j] : i > j;
}
As you can see, instead of looking up individual characters S[i]
and S[j]
, we check the lexicographic rank of the i-th and j-th suffix. Lexicographic ranks were computed in the previous iteration for gap-grams. So, if pos[i] < pos[j]
, then the i-th suffix sa[i]
must start with a gap-gram that is lexicographically smaller than the gap-gram at the beginning of sa[j]
. In other words, simply by looking up pos[i]
and pos[j]
and comparing them, we have compared the first gap characters of the two suffixes.
If the ranks are identical, we continue by comparing pos[i+gap]
with pos[j+gap]
. This is the same as comparing the next gap characters of the (2*gap)-grams, i.e. the second half. If the ranks are indentical again, the two (2*gap)-grams are indentical, so we return 0. Otherwise we return 1 if the i-th suffix is smaller than the j-th suffix, 0 otherwise.
Example
The following example illustrates how the algorithm operates, and demonstrates in particular the role of the lexicographic names in the sorting algorithm.
The string we want to sort is abcxabcd
. It takes three iterations to generate the suffix array for this. In each iteration, I'll show S
(the string), sa
(the current state of the suffix array) and tmp
and pos
, which represent the lexicographic names.
First, we initialize:
S abcxabcd
sa 01234567
pos abcxabcd
Note how the lexicographic names, which initially represent the lexicographic rank of unigrams, are simply identical to the characters (i.e. the unigrams) themselves.
First iteration:
Sorting sa
, using bigrams as sorting criterion:
sa 04156273
The first two suffixes are 0 and 4 because those are the positions of bigram 'ab'. Then 1 and 5 (positions of bigram 'bc'), then 6 (bigram 'cd'), then 2 (bigram 'cx'). then 7 (incomplete bigram 'd'), then 3 (bigram 'xa'). Clearly, the positions correspond to the order, based solely on character bigrams.
Generating the lexicographic names:
tmp 00112345
As described, lexicographic names are assigned as increasing integers. The first two suffixes (both starting with bigram 'ab') get 0, the next two (both starting with bigram 'bc') get 1, then 2, 3, 4, 5 (each a different bigram).
Finally, we map this according to the positions in sa
, to get pos
:
sa 04156273
tmp 00112345
pos 01350124
(The way pos
is generated is this: Go through sa
from left to right, and use the entry to define the index in pos
. Use the corresponding entry in tmp
to define the value for that index. So pos[0]:=0
, pos[4]:=0
, pos[1]:=1
, pos[5]:=1
, pos[6]:=2
, and so on. The index comes from sa
, the value from tmp
.)
Second iteration:
We sort sa
again, and again we look at bigrams from pos
(which each represents a sequence of two bigrams of the original string).
sa 0451627