字符串匹配的并行算法

(并行与分布式计算六)

字符串的预备知识

字符串: 有限长度的字符序列

字符表(或字母表 alphabet)

字符串Y的长度: |Y|

两条字符串X和Y的连接为 XY

字符串Y中第i个字符: Y(i)(1 <= i <= |Y|)

字符串Y的子字符串: Y(i:j)(1 <= i <= j <= |Y|)

字符串Y的前缀(prefix): Y(1:i)(1 <= i <= m)

字符串Y的后缀(suffix): Y(j:m)(1 <= j <= m)

字符串匹配问题及其串行算法

字符串X在字符串Y中出现,是指存在位置i,使得X(j)=Y(i+j-1)对所有1 <= j <= |X|成立

  • 字符串X在位置i与Y匹配(match)

KMP算法:O(|X|+|Y|)复杂度

字符串的周期性(periodicities in string)

  • 如果$Y = X^kX’$,其中$X’$是$X$的一个前缀,则$|X|$是$Y$的一个周期

  • 如果p是Y的最短周期,则简称p是Y的周期(没有加“一个”在前面): The period(周期),a period(一个周期)

  • 如果Y的周期不大于|Y|/2,就称Y是周期的(periodic)

  • e.g: abcaabcab不是周期的(最短周期是abcaabc),abcabcab是周期的(周期p = 3)

字符串的周期引理:

如果p和q都是Y的一个周期且p+q <= |Y|, 则p和q的最大公约数也是Y的一个周期

证明:

  • 由p和q都是Y的一个周期,可得|p-q|也是Y的一个周期: Y(i) = Y(i-p) = Y(i-q) ==> Y(j) = Y(j+p-q)
  • 由最大公约数的欧几里德算法gcd(p, q) = gcd(p-q, q)可以得到结论

推论1:如果X的周期为p且X在Y的i和j位置出现,则|i-j| >= p
推论2:如果X的周期为p且在Y的i和i+d(d <= m-p)位置出现, 则d是p的倍数; 如果0 <= d <= |X|/2, 则X也在i+kp出现(其中k是满足kp <= d的整数)

(区别)见证数组(Witness Array)

  • 长度为m,周期为p的字符串Y,定义Pi(Y) = min{p, $\lceil \frac{m}{2} \rceil$}
  • 见证函数Wit(i)定义:
  • Wit(1) = 0
  • Wit(i) = k($2 \leq i \leq Pi(Y)$), k是满足$Y(k) \not= Y(i + k - 1)$的一个下标

  • 见证函数数组WITNESS(1:r)(r = Pi(Y))

  • e.g: abcaabcab,见证数组WITNESS = (0,3,2,2,5)

  • e.g: abcabcab,见证数组WITNESS = (0,3,2)

(区别)见证数组的基本思想:

  • 如果匹配两个位置的子字符串时,可以由某个特定的位置排除这两个位置的其中一个的匹配性,就可以减少字符串匹配的复杂度
  • 这个特定的位置,称为(区别)见证位置(如果在见证位置不同,则不可能匹配)
  • 见证位置与对决

Z字符串为文本字符串,Y为模式;利用见证位置排除掉Z中两处潜在的匹配中的一处的过程称为对决:

witness.jpg

对决算法 Algorithm 7.1

1
2
3
4
5
6
7
8
9
10
11
12
(Duel(i, j))
Input:
(1)A string Z of length n;
(2)a WITNESS array of another string Y of length m <= n;
(3)two indices i and j, where 1 <= i < j <= n - m, such that j - i < Pi(Y) = min{p, $\lceil \frac{m}{2} \rceil$} and p is the period of Y.
Output: One of i or j(that may match); string Y cannot occur in Z at the eliminated position.
begin
1. Set k := WITNESS(j - i + 1)
2. if Z(j + k - 1) != Y(k) then return(i)
else return(j)
end

  • T = O(1), W = O(1)

e.g. (example 7.5)

  • 字符串Y = abcabcab, 见证数组WITNESS = (0, 3, 2)
  • 字符串Z = abcaabcabaa
  • 考虑Z的位置i = 5和j = 7
  • WITNESS(7 - 5 + 1) = 2, Y(2) = b != Z(8) = a, 对决算法输出下标i = 5(可能匹配的位置)

字符串匹配(string matching)的两种分析

string-matching problem: 在Y中找X的所有出现的位置

  • 称Y为待匹配的文本(text),X为待匹配的模式(pattern)

解决字符串匹配问题的经典范式包含两个步骤:

  • 模式分析 Pattern Analysis
    • 只对模式进行处理
    • 将模式的部分信息压缩并保存在一个表中
  • 文本分析 Text Analysis
    • 用模式以及模式分析中产生的信息对文本进行处理

文本分析

  • 文本T(1:n),模式P(1:m),模式的见证数组WITNESS(1:r)(r = Pi(P) = min{p, $\lceil \frac{m}{2} \rceil$}),P周期为p

文本分析算法:非周期模式
Algorithm 7.2

1
2
3
4
5
6
7
8
9
(Text Analysis, Nonperiodic Pattern)
Input: Three arrays T(1:n), P(1:m), and WITNESS(1:m/2) corressponding to a text, a pattern and a witness function of the pattern, respectively. It is assumed that P is nonperiodic, and m is even.
Output: The array MATCH, indicating all the positions where the pattern occurs in the text.
begin
1. Partition T into $\lceil 2n/m \rceil$(up integer) blocks T_i, each block containing no more than m/2 consecutive characters.
2. For each block T_i, eliminate all but one position as a possible candidate for matching by using a balanced binary tree, where each internal node u contains the index returned by the function duel(i, j), and i and j are the indices stored in the children of u.
3. For each candidate position, verify where P occurs at that position in T by using the brute-force algorithm.
end

T = O(log m) W = O(n)

  • e.g:
  • T = babaababaaba, P = abaab
  • 见证数组WITNESS = (0,1,2)
  • 分块: $T_1 = bab,T_2 = aab,T_3 = aba,T_4 = aba$, 对每个块并发处理
  • 第一轮对决(WINESS(2) = 1): Duel(1,2)=2, Duel(4,5)=5, Duel(7,8)=7, Duel(10,11)=10
  • 第二轮对决: Duel(2,3)=2, Duel(5,6)=5,
    Duel(7,9)=7, Duel(10,12)=10
  • 对T的位置2, 5, 7, 10用暴力算法尝试匹配: 2和7能匹配

文本分析算法:周期模式

基本思想:

  • 用周期模式的最大非周期前缀来作为初始模式
  • 查找文本中初始模式的位置
  • 要找出文本中周期模式的位置,可以用初始模式的位置来筛选
  • 只有周期重复次数足够的位置才能在筛选中存活下来
  • 周期重复次数可以用分段后缀和算法来并行计算

  • P(1:m) = $u^kv$, v是u的一个前缀(可空), |u| = p是P的周期, 见证数组WITNESS长度为p

  • 将P(1:2p-1)作为一个新的模式(P的最大非周期前缀)在T中在O(logp)时间内用O(n)操作找出所有匹配

  • 检查$u^2v$是否出现在(T的)位置i, i+p, …, i+(k-2)p(k是P(1:m)=$u^kv$中的k)

Algorithm 7.3

1
2
3
4
5
6
7
8
9
10
(Text Analysis, Periodic Case)
Input: Three arrays T(1:n), P(1:m), and WITNESS(1:p) representing respectively a text, a pattern, and a WITNESS array corresponding to P(1:2p-1). We also know that the pattern is periodic and has period p.
Output: The array MATCH identifying all the positions in the text where the pattern occurs.
begin
1. Use Algorithm 7.2 to determine the occurrences of the prefix P(1:2p-1) in T.
2. For each occurence of P(1:2p-1) in T(at position i), find whether u^2v occurs at i, where u = P(1:p), v = P(kp+1:m, and k = $\lfloor m/p \rfloor$(lower integer)). If it does, set M(i) := 1. For all the other positions i such that 1 <= i <= n, set M(i) := 0.
3. For each i such that 1 <= i <= p, let S_i be the subarray of M consisting of the bits (M(i), M(i+p), M(i+2p), ...). For each position l of S_i that contains a 1, set C(i, l) := 1 if there are at least k-1 consecutive 1's starting at this position. For all the remaining positions l of S_i, set C(i, l) := 0.
4. For each 1 <= j <= n - m + 1, let j = i + lp, where 1 <= i <= p and l >= 0. Then, set MATCH(j) := C(i, l+1).
end

  • T = O(logm), W = O(n)

  • e.g. (example 7.8)

  • T = babababababaabab = (ba)^6(ab)^2, P = abababa = (ab)^3a
  • p = 2, P(1:2p-1) = P(1:3) = aba
  • T中出现P(1:3)的位置2, 4, 6, 8, 10, 13
  • (ab)^2a在T中出现位置2, 4, 6, 8; M = (0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0)
  • S_1 = (0, 0, 0, 0, 0, 0, 0, 0), S_2 = (1, 1, 1, 1, 0, 0, 0, 0)
  • C(1, l) = 0(1 <= l <= 8)
  • C(2, 1) = C(2, 2) = C(2, 3) = 1, C(2, l) = 0(4 <= l <= 8)
  • MATCH(2) := C(2, 1) = 1, MATCH(4) := C(2, 2) = 1, MATCH(6) := C(2, 3) = 1, MATCH(i) := 0 otherwise.
  • P 出现在T的2, 4, 6位置

模式分析算法

  • 目标:计算见证数组WITNESS
  • WITNESS(i)(1 <= i <= r, r = min{p, $\lceilm/2\rceil$}) = k != 0, k满足P(i + k - 1)
  • P(1:i-1)不是模式的周期

  • 基本思想:

    • 迭代算法,每一迭代令模式的短子模式长度翻倍
    • 从模式的短子模式开始,通过短子模式来计算长子模式见证数组的部分元素
    • 剩余的元素在后面的迭代中计算

简单算法

lemma 7.7:
Suppose that we are given WITNESS(2:t) of the string P(1:m), for some t < min(p, m/2), where p is the period of P. Then, given any pair of distinct indices i and j, such that t < i, j <= $\lceil\frac{m}{2}\rceil$ and |j - i| < t, we can compute at least one of WITNESS(i) or WITNESS(j) in O(1) sequential time.

pattern_analysis.jpg

Algorithm 7.5

1
2
3
4
5
6
7
8
9
10
11
12
13
(Simple Pattern Analysis)
Input: A pattern P(1:m), where m = 2^s.
Output: The array WITNESS(1:r), where r = min(p, 2^(s-1)), and p is the period of P.
begin
1. for 1 <= i <= m/2 pardo
Set WITNESS(i) := 0
2. for i = 1 to s - 1 do
2.1 Let j != 1 be the candidate in the first i-block. Compute WITNESS(j) using the brute force algorithm.
2.2 if WITNESS(j) = 0 then exit
2.3 for all i_blocks B(except the first) pardo
Apply duel(k, s), where k and s are the candidates in B.
end

  • T = O(log m) W = O(m log m)