字符串匹配的并行算法

(并行与分布式计算六)

字符串的预备知识

字符串: 有限长度的字符序列

字符表(或字母表 alphabet)

字符串Y的长度: |Y|

两条字符串X和Y的连接为 XY

字符串Y中第i个字符: Y(i)(1 <= i <= |Y|)

字符串Y的子字符串: Y(i:j)(1 <= i <= j <= |Y|)

字符串Y的前缀(prefix): Y(1:i)(1 <= i <= m)

字符串Y的后缀(suffix): Y(j:m)(1 <= j <= m)

字符串匹配问题及其串行算法

字符串X在字符串Y中出现，是指存在位置i，使得X(j)=Y(i+j-1)对所有1 <= j <= |X|成立

字符串X在位置i与Y匹配(match)

KMP算法：O(|X|+|Y|)复杂度

字符串的周期性(periodicities in string)

如果$Y = X^kX’$，其中$X’$是$X$的一个前缀，则$|X|$是$Y$的一个周期
如果p是Y的最短周期，则简称p是Y的周期(没有加“一个”在前面): The period(周期)，a period(一个周期)
如果Y的周期不大于|Y|/2，就称Y是周期的(periodic)
e.g: abcaabcab不是周期的(最短周期是abcaabc)，abcabcab是周期的(周期p = 3)

字符串的周期引理:

如果p和q都是Y的一个周期且p+q <= |Y|, 则p和q的最大公约数也是Y的一个周期

证明:

由p和q都是Y的一个周期，可得|p-q|也是Y的一个周期: Y(i) = Y(i-p) = Y(i-q) ==> Y(j) = Y(j+p-q)

由最大公约数的欧几里德算法gcd(p, q) = gcd(p-q, q)可以得到结论

推论1：如果X的周期为p且X在Y的i和j位置出现，则|i-j| >= p
推论2：如果X的周期为p且在Y的i和i+d(d <= m-p)位置出现, 则d是p的倍数; 如果0 <= d <= |X|/2, 则X也在i+kp出现(其中k是满足kp <= d的整数)

(区别)见证数组(Witness Array)

长度为m，周期为p的字符串Y，定义Pi(Y) = min{p, $\lceil \frac{m}{2} \rceil$}
见证函数Wit(i)定义：
Wit(1) = 0
Wit(i) = k($2 \leq i \leq Pi(Y)$), k是满足$Y(k) \not= Y(i + k - 1)$的一个下标
见证函数数组WITNESS(1:r)(r = Pi(Y))
e.g: abcaabcab，见证数组WITNESS = (0,3,2,2,5)
e.g: abcabcab，见证数组WITNESS = (0,3,2)

(区别)见证数组的基本思想:

如果匹配两个位置的子字符串时，可以由某个特定的位置排除这两个位置的其中一个的匹配性，就可以减少字符串匹配的复杂度
这个特定的位置，称为（区别）见证位置(如果在见证位置不同，则不可能匹配)

见证位置与对决

Z字符串为文本字符串，Y为模式；利用见证位置排除掉Z中两处潜在的匹配中的一处的过程称为对决：

对决算法 Algorithm 7.1

(Duel(i, j))
Input:
(1)A string Z of length n;
(2)a WITNESS array of another string Y of length m <= n;
(3)two indices i and j, where 1 <= i < j <= n - m, such that j - i < Pi(Y) = min{p, $\lceil \frac{m}{2} \rceil$} and p is the period of Y.
Output: One of i or j(that may match); string Y cannot occur in Z at the eliminated position.
begin
1. Set k := WITNESS(j - i + 1)
2. if Z(j + k - 1) != Y(k) then return(i)
                           else return(j)
end

T = O(1), W = O(1)

e.g. (example 7.5)

字符串Y = abcabcab, 见证数组WITNESS = (0, 3, 2)
字符串Z = abcaabcabaa
考虑Z的位置i = 5和j = 7
WITNESS(7 - 5 + 1) = 2, Y(2) = b != Z(8) = a, 对决算法输出下标i = 5(可能匹配的位置)

字符串匹配(string matching)的两种分析

string-matching problem: 在Y中找X的所有出现的位置

称Y为待匹配的文本(text)，X为待匹配的模式(pattern)

解决字符串匹配问题的经典范式包含两个步骤:

模式分析 Pattern Analysis
- 只对模式进行处理
- 将模式的部分信息压缩并保存在一个表中
文本分析 Text Analysis
- 用模式以及模式分析中产生的信息对文本进行处理

文本分析

文本T(1:n)，模式P(1:m)，模式的见证数组WITNESS(1:r)(r = Pi(P) = min{p, $\lceil \frac{m}{2} \rceil$})，P周期为p

文本分析算法：非周期模式
Algorithm 7.2

(Text Analysis, Nonperiodic Pattern)
Input: Three arrays T(1:n), P(1:m), and WITNESS(1:m/2) corressponding to a text, a pattern and a witness function of the pattern, respectively. It is assumed that P is nonperiodic, and m is even.
Output: The array MATCH, indicating all the positions where the pattern occurs in the text.
begin
1. Partition T into $\lceil 2n/m \rceil$(up integer) blocks T_i, each block containing no more than m/2 consecutive characters.
2. For each block T_i, eliminate all but one position as a possible candidate for matching by using a balanced binary tree, where each internal node u contains the index returned by the function duel(i, j), and i and j are the indices stored in the children of u.
3. For each candidate position, verify where P occurs at that position in T by using the brute-force algorithm.
end

T = O(log m) W = O(n)

e.g:
T = babaababaaba, P = abaab
见证数组WITNESS = (0,1,2)
分块: $T_1 = bab,T_2 = aab,T_3 = aba,T_4 = aba$, 对每个块并发处理
第一轮对决(WINESS(2) = 1): Duel(1,2)=2, Duel(4,5)=5, Duel(7,8)=7, Duel(10,11)=10
第二轮对决: Duel(2,3)=2, Duel(5,6)=5,
Duel(7,9)=7, Duel(10,12)=10
对T的位置2, 5, 7, 10用暴力算法尝试匹配: 2和7能匹配

文本分析算法：周期模式

基本思想：

用周期模式的最大非周期前缀来作为初始模式
查找文本中初始模式的位置
要找出文本中周期模式的位置，可以用初始模式的位置来筛选
只有周期重复次数足够的位置才能在筛选中存活下来
周期重复次数可以用分段后缀和算法来并行计算
P(1:m) = $u^kv$, v是u的一个前缀(可空), |u| = p是P的周期, 见证数组WITNESS长度为p
将P(1:2p-1)作为一个新的模式(P的最大非周期前缀)在T中在O(logp)时间内用O(n)操作找出所有匹配
检查$u^2v$是否出现在(T的)位置i, i+p, …, i+(k-2)p(k是P(1:m)=$u^kv$中的k)

Algorithm 7.3

(Text Analysis, Periodic Case)
Input: Three arrays T(1:n), P(1:m), and WITNESS(1:p) representing respectively a text, a pattern, and a WITNESS array corresponding to P(1:2p-1). We also know that the pattern is periodic and has period p.
Output: The array MATCH identifying all the positions in the text where the pattern occurs.
begin
1. Use Algorithm 7.2 to determine the occurrences of the prefix P(1:2p-1) in T.
2. For each occurence of P(1:2p-1) in T(at position i), find whether u^2v occurs at i, where u = P(1:p), v = P(kp+1:m, and k = $\lfloor m/p \rfloor$(lower integer)). If it does, set M(i) := 1. For all the other positions i such that 1 <= i <= n, set M(i) := 0. 
3. For each i such that 1 <= i <= p, let S_i be the subarray of M consisting of the bits (M(i), M(i+p), M(i+2p), ...). For each position l of S_i that contains a 1, set C(i, l) := 1 if there are at least k-1 consecutive 1's starting at this position. For all the remaining positions l of S_i, set C(i, l) := 0.
4. For each 1 <= j <= n - m + 1, let j = i + lp, where 1 <= i <= p and l >= 0. Then, set MATCH(j) := C(i, l+1).
end

T = O(logm), W = O(n)
e.g. (example 7.8)
T = babababababaabab = (ba)^6(ab)^2, P = abababa = (ab)^3a
p = 2, P(1:2p-1) = P(1:3) = aba
T中出现P(1:3)的位置2, 4, 6, 8, 10, 13
(ab)^2a在T中出现位置2, 4, 6, 8; M = (0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0)
S_1 = (0, 0, 0, 0, 0, 0, 0, 0), S_2 = (1, 1, 1, 1, 0, 0, 0, 0)
C(1, l) = 0(1 <= l <= 8)
C(2, 1) = C(2, 2) = C(2, 3) = 1, C(2, l) = 0(4 <= l <= 8)
MATCH(2) := C(2, 1) = 1, MATCH(4) := C(2, 2) = 1, MATCH(6) := C(2, 3) = 1, MATCH(i) := 0 otherwise.
P 出现在T的2, 4, 6位置

模式分析算法

目标：计算见证数组WITNESS
WITNESS(i)(1 <= i <= r, r = min{p, $\lceilm/2\rceil$}) = k != 0, k满足P(i + k - 1)
P(1:i-1)不是模式的周期
基本思想:
- 迭代算法，每一迭代令模式的短子模式长度翻倍
- 从模式的短子模式开始，通过短子模式来计算长子模式见证数组的部分元素
- 剩余的元素在后面的迭代中计算

简单算法

lemma 7.7:
Suppose that we are given WITNESS(2:t) of the string P(1:m), for some t < min(p, m/2), where p is the period of P. Then, given any pair of distinct indices i and j, such that t < i, j <= $\lceil\frac{m}{2}\rceil$ and |j - i| < t, we can compute at least one of WITNESS(i) or WITNESS(j) in O(1) sequential time.

Algorithm 7.5

(Simple Pattern Analysis)
Input: A pattern P(1:m), where m = 2^s.
Output: The array WITNESS(1:r), where r = min(p, 2^(s-1)), and p is the period of P.
begin
1. for 1 <= i <= m/2 pardo
    Set WITNESS(i) := 0
2. for i = 1 to s - 1 do
    2.1 Let j != 1 be the candidate in the first i-block. Compute WITNESS(j) using the brute force algorithm.
    2.2 if WITNESS(j) = 0 then exit
    2.3 for all i_blocks B(except the first) pardo
        Apply duel(k, s), where k and s are the candidates in B.
end

T = O(log m) W = O(m log m)

…