先贴题目要求和翻译:
Breathalyzer
To safeguard against the dreaded phenomenon of wall posting while drunk, Facebook is implementing a feature that detects when post content is too garbled to have been done while sober and informs the user that they need to take an online breathalyzer test before being allowed to post.
废话,不翻译。
Unfortunately, there is far too much content for a given set of persons to evaluate by hand. Fortunately, you are a programmer of some repute and can write a program that processes and evaluates wall posts. You realize that such a program would be of great use to society and intend to resolve the problem once and for all. The program you write must compute a score for a body of text, returning this score upon completion.
废话,不翻译。
Your program will be given a list of accepted words and run on one wall post at a time. For each word W in the post, you must find word W’ from the list of accepted words such that the number of changes from W to W’ is minimized. It is possible that W is already W’ and thus the number of changes necessary is zero. A change is defined as replacing a single letter with another letter, adding a letter in any position, or removing a letter from any position. The total score for the wall post is the minimum number of changes necessary to make all words in the post acceptable.
题目提供了一个单词表(从这里下载) 。对于待分析字符串中的每个词,你的程序要从单词表中找到最接近的词并计算这两个词的“距离”。两个字符串的“距离”定义为从字符串1变成字符串2所需要的最小操作数,允许的操作包括:
- 将一个字母替换为另一个字母
- 删除一个字母
- 插入一个字母
以”kitten” -> “sitting“为例,这两个词的”距离“为3,因为需要至少三步:
- kitten -> sitten 替代
- sitten -> sittin 替代
- sittin -> sitting 插入
Input Specification
Your program must take a single string argument, representing the file name containing the wall post to analyze. In addition, your program must open up and read the accepted word list from the following static path location:
你的程序的参数为: 单词表文件 待分析文本文件
For testing purposes, you may download and examine the accepted word list here. When submitting your code, you do not need to include this file, as it is already present on the machine.
The input file consists entirely of lower case letters and space characters. You are guaranteed that the input file will start with a lower case letter, and that all words are separated by at least one space character. The file may or may not end with a new line character.
Example input file:
tihs sententcnes iss nout varrry goud
You are guaranteed that your program will run against well formed input files and that the accepted word list is identical to the one provided for testing.
Output Specification
Your program must print out the minimum number of changes necessary to turn all words in the input wall post into accepted words as defined by the word list file. Words may not be joined together, or separated into multiple words. A change in a word is defined as one of the following:
- Replacing any single letter with another letter.
- Adding a single letter in any position.
- Removing any single letter.
- 替换一个字母
- 添加一个字母
- 删除一个字母
This score must be printed out as an integer and followed by a single new line.
Example Output (newline after number):
那么我们的程序的基本流程为:取一个字符串 -> 和单词表中的每一个词计算距离 -> 找到最小值 -> 取下一个字符串
首先需要留意的就是那个”距离“(原文为“changes”),这个函数是整个程序的核心,因为其他的步骤无非就是字符串分析和查找最小值,没什么难度。
这个”距离“,即两个字符串由一个转成另一个所需的最少编辑操作次数(接受的操作定义见前述),称为”编辑距离“,或”Levenshtein Distance“(后面以LD代替),由前苏联科学家Vladimir Levenshtein于1965年提出。中文wiki对这个算法的描述甚少,只有一个伪代码描述。这里我把英文wiki的描述在这里翻译一下,把整个LD的计算过程理顺一些。
又是动态规划问题,依然使用自底向上的思路解决。
首先考虑空串到空串的情况,不需要任何操作,LD=0。
直接拿矩阵说话:
|
|
k |
i |
t |
t |
e |
n |
|
0 |
1 |
2 |
3 |
4 |
5 |
6 |
| s |
1 |
1 |
2 |
3 |
4 |
5 |
6 |
| i |
2 |
2 |
1 |
2 |
3 |
4 |
5 |
| t |
3 |
3 |
2 |
1 |
2 |
3 |
4 |
| t |
4 |
4 |
3 |
2 |
1 |
2 |
3 |
| i |
5 |
5 |
4 |
3 |
2 |
2 |
3 |
| n |
6 |
6 |
5 |
4 |
3 |
3 |
2 |
| g |
7 |
7 |
6 |
5 |
4 |
4 |
3 |
distance_matrix[x][y]的值可能有三种来源:
- distance_matrix[x-1][y] + 1, 例如: 已知LD(s,k)=1 那么 LD(si,k)= LD(s,k) +1 (删除)
- distance_matrix[x][y-1] + 1,例如: 已知LD(s,k)=1 那么LD(s, ki) = LD(s,k) +1 (添加)
- distance_matrix[x-1][y-1] +1 ,例如: 已知LD(sitt, kitte) = 2 那么 LD(sitti, kitten) = LD(sitt, kitte) + 1 (替换)
取三个值中的最小值即可。
依此类推,最后得到完整的矩阵。矩阵右下角的值就是需要的LD(str1, str2),算法的时间复杂度为O(n2)
def ld(s, t):
"""Find the Levenshtein distance between two strings."""
m = len(s)
n = len(t)
if m > n:
s, t = t, s
m, n = n, m
if n == 0:
return m
# d is a matrix with m+1 rows and n+1 columns
d = [[0] * (n+1) for x in range(m+1)]
for i in range(m+1): # from 0 to m
d[i][0] = i
for j in range(n+1): # from 0 to n
d[0][j]=j
for i in range(m):
for j in range(n):
if s[i] == t[j]:
d[i+1][j+1] = d[i][j]
else:
deletion = d[i][j+1] + 1
insertion = d[i+1][j] + 1
substitution = d[i][j] + 1
d[i+1][j+1] = min(insertion, deletion, substitution)
return d[m][n]
算法不复杂,但该算法的时间复杂度却不能再降了,因此在单词表中单词量很大且单词长度很长的情况下,该算法慢的要死。题目提供的单词表一共178691的单词,待分析的单词6个,那么需要计算178691*6次LD,对于一个O(n2)的算法来说无法承受。事实上在我的电脑上(Core i5 2.6GHz 逻辑四核)该程序耗时1分钟,只把一个核跑到100%CPU占用。而且这个程序非常适合改造成map-reduce的结构。一个单词和单词表中所有单词的距离的计算可以map到数个计算线程上,每个线程汇报一个线程最小值,最后这几个线程reduce找到最小值。
字符串处理使用python是一个好主意,但是对于并行计算,python的threading机制并不是一个好的选择。stackless python没有试过。我想尝试的是原生并发的erlang。《erlang核心编程》的第20章第4节有一个使用map-reduce的文件内容索引器的范例,可以改造成本题的解答。
然而erlang不适合文件和字符串的parsing,太tmd烂了,而且据某些大牛表示:同是functinal programing (FP) language, Haskell显得更elegant,仿佛若干年前Perl和Python的关系。然而谁知道未来谁会是FP的赢家。erlang不好学,而且字符串parsing太tmd恶心了,为了提高效率还得转成binary,好多算法还要自己重新写。。。。
所以本文为stage 1,什么时候写stage 2我也不知道,等我把erlang的字符串处理搞掉再说。
唉,喜欢Python就是因为我不用在造轮子的问题上浪费时间。