Knowledge Graphs

Exercise 4.1 - String Similarity
Two popular methods to calculate similarity between two strings are edit distance (also known as Levenshtein distance) and the Jaccard measure.

We can define the Levenshtein distance between two strings a, b (of length |a| and |b| respectively), given by lev(a,b) as follows:

  • lev(a,b) = a if |b| = 0
  • lev(a,b) = b if |a| = 0
  • lev(tail(a),tail(b)), if a[0] = b[0]
  • 1 + min{lev(tail(a),b), lev(a,tail(b)), lev(tail(a),tail(b))} otherwise.
where the tail of some string x is a string of all but the first character of x, and x[n] is the nth character of the string x, starting with character 0.

We can define the Jaccard measure between two strings a, b as the size of the intersection divided by the size of the union between the two.

J(a,b) = |a ∩ b| / |a ∪ b|

a. Given the strings "JP Morgan Chase" and "JPMC Corporation", what is the edit distance between the two?
b. Given the strings "JP Morgan Chase" and "JPMC Corporation", what is the Jaccard measure between the two?
c. Given three strings: x = Apple Corporation, CA, y = IBM Corporation, CA, and z = Apple Corp, which of these strings would be equated by the edit distance methods?
d. Given three strings: x = Apple Corporation, CA, y = IBM Corporation, CA, and z = Apple Corp, which of these strings will be equated by the Jaccard measure?
e. Given three strings: x = Apple Corporation, CA, y = IBM Corporation, CA, and z = Apple Corp, what intuition you would use to ensure that x is equated to z?