0x20字符串-(1)-字符串Hash - Be Grinder,for Better

字符串Hash

字符串Hash有很多种方式，最常用的一种就是字符串前缀哈希。

经验之谈：基本上，除了循环节只能用KMP来做，其它都可以用字符串hash来做

字符串前缀哈希

算法实现

字符串前缀哈希是一种常用的字符串匹配算法，它可以在 $O(1)$ 内计算出一个字符串的前缀子串的哈希值，从而实现快速匹配，其原理如下：

假设我们有一个字符串 $s$ ，其长度为 $n$ ，哈希函数为 $h(x)$ ，
取一固定值 $P$ ，把字符串看作 $P$ 进制数，取一固定值 $M$ ，求出该 $P$ 进制数对 $M$ 的余数，作为字符串Hash值：

$h(s[0 \ldots k])=(\sum_{i=0}^k s[i] \times P^i )\ mod \ M$

其中 $P$ 通常取一个大于 $n$ 的质数，常用质数有 $P=131$ 或 $P=13331$ ，此时字符串哈希冲突概率极低。
其中 $M$ 通常取 $M = 2^{64}$ ，即直接使用 unsigned long long 类型存储这个Hash值，溢出时自动对 $2^{64}$ 取模。

个人认为 $P$ 取131是因为 $ASCII$ 码表一共有128个字符，所以找了一个比128大的质数。

对于该字符串 $s$ 的任意子串 $s[l, r]$ ，其Hash值可以在 $O(1)$ 内求出：

$h(s[l, r]) = s[r] - s[l -1] * p^{r - l + 1}$

总的来说，比较像前缀和的思想，因此叫字符串前缀哈希。

#include <string>
#include <vector>
#include <iostream>

using namespace std;

typedef unsigned long long ULL;

const int N = 1e6 + 7;
const int P = 131; // const int P = 13331;

vector<ULL> p(N); // P次幂

vector<ULL> get_prefix_hash(const string& s) {
	int n = s.size();
	vector<ULL> h(n + 1);
	h[0] = 0;
    p[0] = 1;
	for (int i = 1; i <= n; i++) {
		h[i] = h[i - 1] * P + s[i - 1];
        p[i] = p[i - 1] * P;
	}
	return h;
}

ULL get_substr_hash(const vector<ULL>& h, int l, int r) {
	return h[r] - h[l - 1] * p[r - l + 1];
}

int main() {
	string s = "violet violence"; 
	vector<ULL> h = get_prefix_hash(s);

	cout << "Prefix Hash of " << s << ":" << endl;	
	for (int i = 0; i < s.size(); i ++) {
		cout << s.substr(0, i + 1) << " : " << h[i + 1] << endl;
	}

	cout << endl;
	// 计算violet的'viole' 与 violence的'viole'的Hash值是否相同
	cout << get_substr_hash(h, 1, 5) << " and " << get_substr_hash(h, 8, 12) << endl;

	return 0;
}

复杂度分析

时间复杂度： ${O(n)}$ ，预处理前缀和需要 $O(n)$ ，查询字串的时间复杂度是常数级 $O(1)$ 。
空间复杂度： ${O(N)}$ ，要花 $O(n + 1)$ 的空间存储Hash前缀和，花 $O(N)$ 的空间存储P的 $N$ 次幂。

求 $P$ 的次幂也可以用一个快速幂fast_pow()，用时间换空间，但这样每次查询就是 $O(\log(r - l + 1))$ 的了。

总之，不要使用<cmath>中自带的pow()，其返回类型是double，而浮点型转为整型会有精度损失。

#include <string>
#include <vector>
#include <iostream>

using namespace std;

typedef unsigned long long ULL;

const int P = 131; // const int P = 13331;

ULL fast_pow(ULL x, ULL n) {
    ULL res = 1;
    for (; n; n >>= 1) {
        if (n & 1) {
            res = res * x;
        }
        x = x * x;
    }  
    return res;
}

vector<ULL> get_prefix_hash(const string& s) {
	int n = s.size();
	vector<ULL> h(n + 1);
	h[0] = 0;
	for (int i = 1; i <= n; i++) {
		h[i] = h[i - 1] * P + s[i - 1];
	}
	return h;
}

ULL get_substr_hash(const vector<ULL>& h, int l, int r) {
	return h[r] - h[l - 1] * fast_pow(P, r - l + 1); // 不要使用(ULL)pow(P, r - l + 1); 
}

int main() {
	string s = "violet violence"; 
	vector<ULL> h = get_prefix_hash(s);

	cout << "Prefix Hash of " << s << ":" << endl;	
	for (int i = 0; i < s.size(); i ++) {
		cout << s.substr(0, i + 1) << " : " << h[i + 1] << endl;
	}

	cout << endl;
	// 计算violet的'viole' 与 violence的'viole'的Hash值是否相同
	cout << get_substr_hash(h, 1, 5) << " and " << get_substr_hash(h, 8, 12) << endl;

	return 0;
}

时间复杂度： ${O(n)}$ ，预处理前缀和需要 $O(n)$ ，查询字串的时间复杂度是常数级 $O(r - l + 1)$ 。
空间复杂度： ${O(n)}$ ，要花 $O(n + 1)$ 的空间存储Hash前缀和。

练习eg：

Acwing-138.兔子与兔子

应用

求两个字符串的最长公共前缀长度

结合二分查找，可以求得两个字符串的最长公共前缀长度

int get_max_common_prefix(string a, string b) {
   	vector<ULL> ha = get_prefix_hash(a);
    vector<ULL> hb = get_prefix_hash(b);
    
    int l = 0, r = min(a.size(), b.size());
    while (l != r + 1) {
        int mid = l + r >> 1;
        if (get_substr_hash(ha, 0, mid - 1) != get_substr_hash(hb, 0, mid - 1)) {
            // 前缀不等，缩小len
            r = mid - 1;
        }
        else {
            // 前缀相等，扩大len
            l = mid + 1;
        }
    }

    return r; // 返回最长公共前缀长度
}

更进一步，比较下一个字符，就能比较这两个字符串的字典序，详见练习题

练习eg：

Acwing-140.后缀数组

最长回文子串

二分答案，判断是否可行时枚举回文中心（对称轴），哈希判断两侧是否相等。需要分别预处理正着和倒着的哈希值。时间复杂度 $O(n \log n)$ 。

这题可以使用manacher算法在 $O(n)$ 的时间内解决

允许k次失配的字符串匹配

模式串匹配：rabin-karp算法

#include <iostream>
#include <vector>
#include <string>

using namespace std;

typedef unsigned long long ULL;

const int P = 131; // 哈希函数的参数，可以是任意质数
const int N = 1e5 + 10; // 字符串的最大长度

int n, m; // n 为文本串长度，m 为模式串长度
char str[N], pattern[N];
ULL h[N], p[N]; // h 存储文本串的哈希值，p 存储 P 的幂次方

ULL get_hash(int l, int r) // 计算字符串 str[l...r] 的哈希值
{
    return h[r] - h[l - 1] * p[r - l + 1];
}

int main()
{
    scanf("%s%s", str + 1, pattern + 1);
    n = strlen(str + 1);
    m = strlen(pattern + 1);
    
    p[0] = 1;
    for (int i = 1; i <= n; i++)
    {
        h[i] = h[i - 1] * P + str[i];
        p[i] = p[i - 1] * P;
    }
    
    ULL pattern_hash = 0;
    for (int i = 1; i <= m; i++)
    {
        pattern_hash = pattern_hash * P + pattern[i];
    }
    
    for (int i = 1; i + m - 1 <= n; i++)
    {
        ULL hash_i = get_hash(i, i + m - 1);
        if (hash_i == pattern_hash)
        {
            printf("Found pattern at position %d\n", i);
        }
    }
    
    return 0;
}

`std::hash<std::string>`

std::hash<std::string> 是 C++ 标准库中提供的一种哈希函数，用于将 std::string 类型的字符串映射为一个无符号整数，可以用于标准库的哈希容器（例如 unordered_map）中作为 key 的类型。

在实现上，std::hash<std::string> 采用了基于 Horner 算法的哈希算法，其基本思路是将字符串视为一个进制数，每次遍历字符串时，将前面已经计算出的结果乘以一个固定的质数（通常选取 31），然后加上当前字符的 $ASCII$ 码值，最后得到的值就是字符串的哈希值，可以说其思想与字符串前缀哈希的思想是差不多的。std::hash<std::string> 的实现方式类似于下面的代码：

std::size_t operator()(const std::string& s) const {
    std::size_t hash = 0;
    for (char c : s) {
        hash = hash * 31 + c;
    }
    return hash;
}

数据结构与算法

Algorithm Template C++

本博客所有文章均采用 CC BY-NC-SA 4.0 协议，禁止商用，转载请注明出处！

0x20字符串-(2)-KMP 上一篇

链表操作下一篇