Recursion: edit distance

- 5 mins

1. Recursion:

Recursive: the algorithm recurs. i.e. the solution is built up of solutions to smaller parts of the same problem.
A key feature is the “base case” where you can solve the problem in a straightforward way, without further reductions.
For example, applying recursion to define a function of factorial of n ($n!$):

def fact(n):
	if n == 0:
		return 1
	elif:
		return n * fact(n-1)
	else:
		return float('inf')

2. Edit distance (Levenstein distance)

Definition

The edit distance or Levenstein distance, $d_{Edit}(s,t)$, between two strings $s$ and $t$ is the minimum number of edit operations required to transform $s$ into $t$.

An edit operation:

Example: “PRETTY” –> “PROTEIN”

So the edit distance for the strings “PRETTY” and “PROTEIN” is 4.

Theorem

It is possible express the edit distance recursively:

Python

Define function

In Python, we can define a script edit_distance.py to implement edit_dist function.

def edit_dist(s,t):
	if len(s) == 0:
		dist = len(t)
	elif len(t) == 0:
		dist = len(s)
	else:
		if s[0] == t[0]:
			dist = edit_dist(s[1:],t[1:])
		else:
			dist = min( edit_dist(s[1:],t) + 1, edit_dist(s,t[1:]) + 1, edit_dist(s[1:],t[1:]) + 1)
	return dist
if __name__ == "__main__":
	s = input()
	t = input()
	print( edit_dist(s,t) )

Automate the testing in bash:

echo -e 'PRETTY\nPROTEIN' | python3 edit_distance.py

|: pipe. Passes the output (stdout) of a previous command to the input (stdin) of the next one, or to the shell.

Randomization and visualization

import string, time, matplotlib.pyplot as plt
from random import choice
from edit_distance import edit_dist

def random_string(len=3, alphabet=string.ascii_uppercase):
# string.ascii_uppercase : 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
	"""generate a random string of a specified length
	from a specified alphabet"""
	return "".join([ choice(alphabet) for _ in range(len)])
# choice(alphabet): randomly choose a character from alphabet
# [ choice(alphabet) for _ in range(len)]: create a list of random characters with the same length as len
# https://stackoverflow.com/questions/5893163/what-is-the-purpose-of-the-single-underscore-variable-in-python
# "".join([ choice(alphabet) for _ in range(len)]) : make a string out of a list, seperated with ""

times, ed_dists = [], []
for n in range(1000):
	s = random_string(len=10,alphabet="ACGT")
	t = random_string(len=10,alphabet="ACGT")
	start = time.time_ns()
	dist = edit_dist(s,t)
	end = time.time_ns()
	times.append(end-start)
	ed_dists.append(dist)
	if n % 10 == 0:
		print(n) # some feedback so we know it is running

plt.scatter(ed_dists, times)
plt.show()

Speed up

Wrap edit_dist with the memoizer lru_cache from the functools library. We can just modify the edit_distance.py file by adding two lines at the beginning:

from functools import lru_cache

@lru_cache(maxsize=None) 
def edit_dist(s, t):
	# your code, as it was before

Reference

Advanced Bash-Scripting Guide: Chapter 3. Special Characters
the single underscore “_” variable in Python?

Zhijian Liu

Zhijian Liu

A foodaholic

comments powered by Disqus
rss facebook twitter github gitlab youtube mail spotify lastfm instagram linkedin google google-plus pinterest medium vimeo stackoverflow reddit quora quora