抽屉里的口琴

放在那里已经两个月了。

我是一个艺术和体育方面的白痴,和音乐扯上关系也只是在小学时。那时有一门课叫《音乐》,课本16开彩色印刷,是小学所有课本中最漂亮的,以至于每次发新书都会把它包上挂历纸的书皮,然后用紫色(那时我最喜欢的颜色)彩笔在上面歪歪扭扭的写下“音乐”两个字。然后在五年级时,音乐课本上出现了两种乐器:竖笛和口琴。

老师自然是教竖笛的(确切的说是卖竖笛)。米白色的笛子加上墨绿色的布套,斜躺在书包里露出个头,每天小小子和小姑娘就这样背着乐器往返于学校和家。但对于我来说笛子有一个致命的缺陷(所以哥是音乐白痴),笛子有六个孔,最下面的两个孔间距比其他孔都要大,导致俺的右手中指和无名指无法同时按住这两个孔,因此在吹奏so和la时右手的动作总是非常不自然(就是难受!)。经过一段时间的挣扎,还是放弃了笛子。

然后就转向了口琴,因为家中正好有一支上海牌28孔双音口琴。相对于笛子,这个貌似更简单,只要记住相对位置就好了。虽然一次只含住1个孔,但相对于笛子至少不用让手指受罪了。这就是和口琴的结缘。

然后是小学六年级,学计算机时先搞了GWBASIC,然后是Play命令,于是为了把音乐书上的歌用电脑放出来,学会了简谱。但也止步于此,再也没有碰过五线谱。现在看到do re mi fa so la si还是会下意识地想到 C D E F G A B。

然后就是初中,只在某一年的新年联欢会上吹过一首李叔同的《送别》。爸买了个敦煌双音口琴送我。

高中,从来没碰过。

本科,玩了Falcom的《空之轨迹》系列。为了重现《星之所在》再度拿起口琴,有一个音高始终吹不出来也就作罢了,唯一的回忆:午后暖暖的阳光,空无一人的交大北区河岸边,背完英语课文掏出口琴吹一曲。

研究生,amazon上$7入手一布鲁斯口琴,恰巧孙娜也买了一支布鲁斯。之后忙于课业科研我一直没碰过,不知道她有没有吹出完整的《星之所在》。

口琴放在那里已经两个多月了。

刚来美国时偶尔还会流着泪醒来,起床后有心痛的感觉。

现在似乎渐渐适应了这种见不到你们和你的生活。

但我还是不想0点就睡觉,想着睡前看到你们的QQ头像一个个变成彩色,在商场看到有趣的东西还是会想到“这个送给xx不错”,还是上人人网不上facebook。

anyway,i’m doing good

anyway, i miss u so much

Facebook puzzle 2 – Breathalyzer – Stage 1

先贴题目要求和翻译:

Breathalyzer

To safeguard against the dreaded phenomenon of wall posting while drunk, Facebook is implementing a feature that detects when post content is too garbled to have been done while sober and informs the user that they need to take an online breathalyzer test before being allowed to post.

废话,不翻译。

Unfortunately, there is far too much content for a given set of persons to evaluate by hand. Fortunately, you are a programmer of some repute and can write a program that processes and evaluates wall posts. You realize that such a program would be of great use to society and intend to resolve the problem once and for all. The program you write must compute a score for a body of text, returning this score upon completion.

废话,不翻译。

Your program will be given a list of accepted words and run on one wall post at a time. For each word W in the post, you must find word W’ from the list of accepted words such that the number of changes from W to W’ is minimized. It is possible that W is already W’ and thus the number of changes necessary is zero. A change is defined as replacing a single letter with another letter, adding a letter in any position, or removing a letter from any position. The total score for the wall post is the minimum number of changes necessary to make all words in the post acceptable.

题目提供了一个单词表(从这里下载) 。对于待分析字符串中的每个词,你的程序要从单词表中找到最接近的词并计算这两个词的“距离”。两个字符串的“距离”定义为从字符串1变成字符串2所需要的最小操作数,允许的操作包括:

  1. 将一个字母替换为另一个字母
  2. 删除一个字母
  3. 插入一个字母
以”kitten” -> “sitting“为例,这两个词的”距离“为3,因为需要至少三步:
  1. kitten -> sitten 替代
  2. sitten -> sittin 替代
  3. sittin -> sitting 插入

Input Specification

Your program must take a single string argument, representing the file name containing the wall post to analyze. In addition, your program must open up and read the accepted word list from the following static path location:

/var/tmp/twl06.txt

你的程序的参数为: 单词表文件 待分析文本文件

For testing purposes, you may download and examine the accepted word list here. When submitting your code, you do not need to include this file, as it is already present on the machine.

The input file consists entirely of lower case letters and space characters. You are guaranteed that the input file will start with a lower case letter, and that all words are separated by at least one space character. The file may or may not end with a new line character.

Example input file:

tihs sententcnes iss nout varrry goud

You are guaranteed that your program will run against well formed input files and that the accepted word list is identical to the one provided for testing.

Output Specification

Your program must print out the minimum number of changes necessary to turn all words in the input wall post into accepted words as defined by the word list file. Words may not be joined together, or separated into multiple words. A change in a word is defined as one of the following:

  1. Replacing any single letter with another letter.
  2. Adding a single letter in any position.
  3. Removing any single letter.
  1. 替换一个字母
  2. 添加一个字母
  3. 删除一个字母

This score must be printed out as an integer and followed by a single new line.

Example Output (newline after number):

8

那么我们的程序的基本流程为:取一个字符串 -> 和单词表中的每一个词计算距离 -> 找到最小值 -> 取下一个字符串

首先需要留意的就是那个”距离“(原文为“changes”),这个函数是整个程序的核心,因为其他的步骤无非就是字符串分析和查找最小值,没什么难度。

这个”距离“,即两个字符串由一个转成另一个所需的最少编辑操作次数(接受的操作定义见前述),称为”编辑距离“,或”Levenshtein Distance“(后面以LD代替),由前苏联科学家Vladimir Levenshtein于1965年提出。中文wiki对这个算法的描述甚少,只有一个伪代码描述。这里我把英文wiki的描述在这里翻译一下,把整个LD的计算过程理顺一些。

又是动态规划问题,依然使用自底向上的思路解决。

首先考虑空串到空串的情况,不需要任何操作,LD=0。

直接拿矩阵说话:

k i t t e n
0 1 2 3 4 5 6
s 1 1 2 3 4 5 6
i 2 2 1 2 3 4 5
t 3 3 2 1 2 3 4
t 4 4 3 2 1 2 3
i 5 5 4 3 2 2 3
n 6 6 5 4 3 3 2
g 7 7 6 5 4 4 3

distance_matrix[x][y]的值可能有三种来源:

  1. distance_matrix[x-1][y] + 1, 例如: 已知LD(s,k)=1 那么 LD(si,k)= LD(s,k) +1 (删除)
  2. distance_matrix[x][y-1] + 1,例如:  已知LD(s,k)=1 那么LD(s, ki) = LD(s,k) +1 (添加)
  3. distance_matrix[x-1][y-1] +1 ,例如: 已知LD(sitt, kitte) = 2 那么 LD(sitti, kitten) = LD(sitt, kitte) + 1 (替换)

取三个值中的最小值即可。

依此类推,最后得到完整的矩阵。矩阵右下角的值就是需要的LD(str1, str2),算法的时间复杂度为O(n2)

def ld(s, t):
    """Find the Levenshtein distance between two strings."""
    m = len(s)
    n = len(t)
    if m > n:
        s, t = t, s
        m, n = n, m
    if n == 0:
        return m

    # d is a matrix with m+1 rows and n+1 columns
    d = [[0] * (n+1) for x in range(m+1)]

    for i in range(m+1): # from 0 to m
       d[i][0] = i
    for j in range(n+1): # from 0 to n
       d[0][j]=j
    for i in range(m):
        for j in range(n):
            if s[i] == t[j]:
                d[i+1][j+1] = d[i][j]
            else:
                deletion = d[i][j+1] + 1
                insertion = d[i+1][j] + 1
                substitution = d[i][j] + 1
                d[i+1][j+1] = min(insertion, deletion, substitution)
    return d[m][n]

算法不复杂,但该算法的时间复杂度却不能再降了,因此在单词表中单词量很大且单词长度很长的情况下,该算法慢的要死。题目提供的单词表一共178691的单词,待分析的单词6个,那么需要计算178691*6次LD,对于一个O(n2)的算法来说无法承受。事实上在我的电脑上(Core i5 2.6GHz 逻辑四核)该程序耗时1分钟,只把一个核跑到100%CPU占用。而且这个程序非常适合改造成map-reduce的结构。一个单词和单词表中所有单词的距离的计算可以map到数个计算线程上,每个线程汇报一个线程最小值,最后这几个线程reduce找到最小值。

字符串处理使用python是一个好主意,但是对于并行计算,python的threading机制并不是一个好的选择。stackless python没有试过。我想尝试的是原生并发的erlang。《erlang核心编程》的第20章第4节有一个使用map-reduce的文件内容索引器的范例,可以改造成本题的解答。

然而erlang不适合文件和字符串的parsing,太tmd烂了,而且据某些大牛表示:同是functinal programing (FP) language, Haskell显得更elegant,仿佛若干年前Perl和Python的关系。然而谁知道未来谁会是FP的赢家。erlang不好学,而且字符串parsing太tmd恶心了,为了提高效率还得转成binary,好多算法还要自己重新写。。。。

所以本文为stage 1,什么时候写stage 2我也不知道,等我把erlang的字符串处理搞掉再说。

唉,喜欢Python就是因为我不用在造轮子的问题上浪费时间。

Facebook puzzle 1 – Gattaca

Facebook puzzle是facebook放出的一些算法应用题,传说中如果解出了所有题目就可以被录用。按照目前的就业形势来看,google的一群大佬都想着往facebook跑,facebook成了IT就业的新风向标。俺一介算法弱人做做题学学算法也是好的,遂有此系列。题录列表请访问 http://www.facebook.com/careers/puzzles.php

Gattaca

You have a DNA string that you wish to analyze. Of particular interest is which intervals of the string represent individual genes. You have a number of “gene predictions”, each of which assigns a score to an interval within the DNA string, and you want to find the subset of predictions such that the total score is maximized while avoiding overlaps. A gene prediction is a triple of the form (start, stop, score). start is the zero-based index of the first character in the DNA string contained in the gene. stop is the index of the last character contained in the gene. score is the score for the gene.

有一段字符串待分析,特定的子串被设定了用整数表示的分数。分析的目的是从提供的字符串中检出多个互不相交的子串且保证这些子串的分数最大。

Input Specification

Your program will be passed the name of an input file on the command line. The contents of that file are as follows.

你的程序接受的参数应该只有文件名。该文件的内容结构如下:

The first line of the input contains only n, the length of the DNA string you will be given.

第一行是一个数字,代表字符串总长度。

The next ceiling(n / 80) lines each contain string of length 80 (or n % 80 for the last line) containing only the characters ‘A’, ‘C’, ‘G’, and ‘T’. Concatenate these lines to get the entire DNA strand.

接下来的几行是字符串,每一行最多80个字符。你需要把这几行连起来构成完整的待分析字符串(虽然用不到。。。)

The next line contains only g, the number of gene predictions you will be given.

这一行只有一个数字,代表待提取子串的数量。

The next g lines each contain a whitespace-delimited triple of integers of the form

<start> <stop> <score>

representing a single gene prediction. No gene predictions will exceed the bounds of the DNA string or be malformed (start is non-negative and no more than stop, stop never exceeds n – 1).

一行由三个由空格分隔的数组成,start代表字串的起始位置,stop代表终止位置,score代表对应的分数

Example Input:

100
GAACTATCGCCCGTGCGCATCGCCCGTCCGACCGGCCGTAAGTCTATCTCCCGAGCGGGCGCCCGATCTCAAGTGCACCT
CACGGCCTCACGACCGTGAG
8
43  70  27
3   18  24
65  99  45
20  39  26
45  74  26
10  28  20
78  97  23
0   9   22

Output Specification

Print to standard out the score of the best possible subset of the gene predictions you are given such that no single index in the DNA string is contained in more than one gene prediction, followed by a newline. The total score is simply the sum of the scores of the gene predictions included in your final result.

When constructing your output, you may only consider genes exactly as they are described in the input. If you find the contents of a gene replicated elsewhere in the DNA string, you are not allowed to treat the second copy as a viable gene. Your solution must be fast and efficient to be considered correct by the robot.

Example Output:

100

总而言之,find the maximum sum in a sequence of non-overlapping intervals,找出总分值最高且位置互不相交的字串集合。题目只要输出最后的最大总分值即可。

首先这是一个典型的动态规划问题。解决动态规划问题的基本思想是将原问题分解为相似的子问题,在求解的过程中通过子问题的解求出原问题的解。对于本题来说,提供了8个子串,要找出具有最大分数且互不重叠的子串集合,可以从子串较少的情况入手,例如只有一个子串。显然这种情况下,最大分数就是该字符串的分数。如果再增加一个子串,为了便于分析我们将所有的子串按照stop值升序排序。那么存在两种情况:

  1. str1和str2不重叠,即str1.stop < str2.start,那么best_score = str1.score + str2.score
  2. str1和str2重叠,即str1.stop > str2.start(注意我们已经将子串按照stop值排序,所以一定有str1.stop<str2.stop),那么best_score = max(str1.score, str2.score)

best[i] 代表每次增加子串得到的最大分数,设按照stop排序的子串集合为 g[1], ..., g[i].

  • 首先 best[0] = 0. 还没有添加任何子串,自然分数为0
  • 对于任意 1 <= k <= N, 有
    • best[k] = max( best[k-1], best[j] + g[k].score ),
      • 其中 j 是满足 g[j].stop < g[k].start的最大值 (j 可以是0)

我们需要的结果就是best[N]

题外话:

facebook puzzle限定了解决问题的语言,可用的语言为:

  • GNU C/C++ 4.2.3
  • Ericsson Erlang 5.5.5
  • GHC Haskell 6.8.2
  • Sun Java 1.5.0_15
  • INRIA OCaml 3.10.0
  • Perl 5.8.8
  • PHP 5.2.4
  • Python 2.5.2
  • Ruby 1.8.6

有趣的是函数语言也占了一席之地。erlang和haskell,一个是目前工业界的新星,一个是学术界的玩具。未来毕竟是分布式计算的天下。学好一门functional programming language还是必要的。关于erlang在facebook puzzle中的小试牛刀在以后的文章中在写。目前我还要花点时间把用erlang解决Breathalyzer的思路理顺一下。那个题目的算法不是要点,要点在于使用分布式计算提高大规模数据处理的速度,而且俺的Python实现实在是太慢了。等整理好了再贴上来。

附俺写的Python源码:

import sys,bisect

class gene(object):
    def __init__(self, start=0, stop=0, score=0):
        self.start = start
        self.stop = stop
        self.score = score

def parse(filename):
    fd = open(filename,'r')
    l = fd.readlines()
    fd.close()
    lines_of_gene = int(l[0]) / 80 + 1

    lines_of_prediction = int(l[lines_of_gene+1])
    start_of_prediction = lines_of_gene + 2
    end_of_prediction   = lines_of_prediction + start_of_prediction 

    prediction_list = []
    for i in range(start_of_prediction, end_of_prediction):
        tmp = l[i].strip().split()
        tmp_gene = gene(int(tmp[0]),int(tmp[1]),int(tmp[2]))
        prediction_list.append(tmp_gene)

    return prediction_list

def calc(l):
    g = sorted(l, key = lambda x: x.stop) #sort by stop
    length = len(g)
    best = [0]*(length+1)

    stop_list = []
    for entry in g:
        stop_list.append(entry.stop)

    for k in range(length):
        j = bisect.bisect_left(stop_list,g[k].start,0,k)
        best[k+1] = max(best[k], best[j] + g[k].score)
    return best[-1]

def main():
    filename = 'data' if len(sys.argv)<2 else sys.argv[1]
    l = parse(filename)
    best = calc(l)
    print best

if __name__ == '__main__':
    main()