Pretty Diff - A Diff Algorithm

Understanding how to compare

It is more about the equality than the differences

A good diff algorithm will attempt to identify as much equality as possible. Everything else qualifies as differences. The metric for quality and precision is a smaller count of captured differences. The smaller this number the better, presuming differences aren't escaping undetected.

False negatives, which is allowing differences through without detection, is really bad. This is absolute failure in a diff algorithm. False positives, which is identifying more differences than there actually are is also bad, but a false positive is much better than a false negative. This means it is safer to report more differences than are actually present, which is why a higher number of reported differences is less precise. Maximum precision is reporting differences without any false negatives or false positives.

One way to achieve higher precision is to first capture as much equality as possible. For everything else that is left over prioritize unique items first and report these items as differences. Finally determine the next bit of equality or uniqueness and everything in between is either a change, an insertion, or a deletion.

Algorithm quality

The primary priorities when writing this kind of code are execution speed, precision (as previous described), and code simplicity. In most cases precision is the most important factor in code design, but in some cases speed is more important when processing large input or a batch of thousands of files. Simplicity is necessary so that other people can understand the code and modify it as necessary to improve upon the design and additional features. No algorithm is ever capable of meeting all needs for all use cases, so it is important that other people can understand the code with minimal effort.

Speed

Faster execution is the result of a couple of things. The most important consideration for making an algorithm faster is to reduce the number of passes through data where possible. After that eliminate all costly operations. In most programming languages simple arithmetic and static string comparisons are cheap to execute particularly if the examined strings aren't changing. The theoretical minimum number of data passes is two as you have to read the contents of each submitted sample. Pretty Diff achieves speed in its algorithm by only 3 complete passes through data and taking all possible effort to never repeat a step or loop iteration. The Pretty Diff approach is linear and predictable where the number of interations passing through data is computed as: number of iterations from the first sample + number of iterations from the second sample + the number of iterations from the smallest of those samples. Performance of the mentioned approach is heavily based upon key access in a hash map, which will likely vary by programming language.

The theoretical minimum number of data passes is two as you have to read the contents of each submitted sample. Until we discover a way to perform a comparison without reading from the samples we can safely assume 2 data passes is the fastest possible approach. Between that and the Pretty Diff approach there are two possibilities for increased performance. The first possibility is to make decisions and write output immediately upon reading from the second sample so as to have only two data passes. The challenge with this approach is that analysis occurs in the moment of the current loop interation without knowledge of what comes later in the second sample. A more practical second performance possibility is to write a smaller hash map. Writing a smaller hash map means moving some of the decision tree up front before a separate formal analysis phase. In order for this approach to be viable this early step in logic must be tiny, non-replicating, and a different means of iteration must be devised to account for a data store of unpredictable size.

This page blew up on Hacker News recently and many comments suggested this approach could not possibly be faster than the Myers' O(ND) approach. That may or may not be true and no evidence was provided either way (conjecture is not evidence).

In terms of experimental criteria algorithms are themselves largely irrelevant. More important is the implementation of those algorithms. If exercising the Myers' approach makes fewer decisions and has fewer total data passes, as described in the previous paragraph, then it likely is faster. I am calling out the word total because this makes all the difference. Many of the diff applications I looked at don't provide a complete third data pass. Instead they provide the minimum two complete data passes over the samples and various smaller passes as calculated by block moves and edit distances. If these various fractional passes are non-linear, which means starting and stopping in different places respectively, their performance is less predictable. If these fractional passes are non-linear and touch any given data index more than once they have repetitive logic, and likely are not as fast. To affirmatively guarantee superior performance over the Pretty Diff approach there needs to be fewer passes over data, which means no repetition and a smaller number of iterations. Preditability ensures the performance of an application scales roughtly proportionately to the size of the provided samples, where an unpredictable approach would scale disproportionately (slower over time). I say roughtly because things in physical reality always mess this up like: solar flares, memory block limitations, CPU heat, and so forth.

I believe the approach taken here is fast. I honestly cannot say, scientifically, it is the fastest ever (or slowest) approach for its level of accuracy without also writing alternate algorithms into applications with identical application constraints. Neither can anyone else. I can safely say this approach is the fastest ever comparative algorithm for its level of predictability and precision.

Output format

The Pretty Diff algorithm inherits its output format from the application jsdifflib. The output is a big array containing child arrays at each index. The child arrays each contain 5 indexes in the format: type, start in first sample, end in first sample, start in second sample, end in second sample. There are 4 types defined in the output: "equal", "replace", "insert", "delete". The "equal" type means both array index items are identical at the indexes compared. The "replace" type means that changes are present in the indexes compared. The "insert" type means the index of the second sample is unique to the second sample. The "delete" type means the index of the first sample is unique to the first sample.

How the algorithm works at a high level

Getting by with a hash map and some counts

The Pretty Diff algorithm takes after the Paul Heckel algorithm. First thing is to transform the string input into arrays. The most primative way to do this is to split the input by lines where each line of input is an index in an arry. Once the two input samples are converted to arrays they will each have to be examined.

The first loop will walk through an array representing one of the code samples and make a decision for a hash map which I call table in the code. I simply named the two arrays one and two. Each index of the array, by default a line of code, will serve as a key in the hash map named table. If this key does not exist then we will add it. The value for each key is an object with two properites named one and two. These properties are simply a count indicating the number of times the key is encountered in the two arrays. If the key is already present then we will simply increase the value of properties one or two respective of the current array.

Here is an example of this step in code:

do {
if (table[two[b]] === undefined) {
table[two[b]] = {
one: 0,
two: 1
};
} else {
table[two[b]].two += 1;
}
b += 1;
} while (b < lenb);

The third and final pass through the data

Once the table is fully populated with all indexes from both arrays one and two we need a third and final loop to walk across the data and make some simple decisions. Here is this complete step in code.

do {
c = a;
d = b;
if (one[a] === two[b]) {
equality();
} else if (table[one[a]][1] < 1 && table[two[b]][0] < 1) {
replaceUniques();
} else if (table[one[a]][1] < 1 && one[a + 1] !== two[b + 2]) {
deletion();
} else if (table[two[b]][0] < 1 && one[a + 2] !== two[b + 1]) {
insertion();
} else if (table[one[a]][0] - table[one[a]][1] === 1 && one[a + 1] !== two[b + 2]) {
deletionStatic();
} else if (table[two[b]][1] - table[two[b]][0] === 1 && one[a + 2] !== two[b + 1]) {
insertionStatic();
} else if (one[a + 1] === two[b]) {
deletion();
} else if (one[a] === two[b + 1]) {
insertion();
} else {
replacement();
}
a = a + 1;
b = b + 1;
} while (a < lena && b < lenb);

Before I go any further I want to be clear this logic is fast and simple, but it isn't precise at all. We will fix this later with a child function named fix. I chose to make corrections for precision as a secondary step so as to not disrupt performance and isolate complexity into a separate single location.

In this code references a and b are simply positive integer incrementors. The a reference is always associated with the one array while the b reference is always associated with the two array. This is necessary so that each array may increment independently without collision. The c and d references are secondary incrementors used as closures in the child functions. These secondary incrementors allow for child loops to occur without mutation to the primary incrementors.

The first thing that happens in this loop is an attempt to discover equality. Everything else is less important and so happens later in the decision tree.

The second rule is to identify items that occur only one more time in each array. If the next item to occur in the arrays is not equal but occurs just once more then we know the item is a unique change.

The third rule is to determine if the current item is a deletion. If the current item, and possibly additional following items, is present in the first sample but no longer exists in the second sample then it is a dynamic deletion.

The fourth rule is the same as the third rule but in reverse to determine dynamic insertions.

The fifth rule is to determine if the current occurs exactly one more time in the first sample than in the second sample and yet is not a match for the next item in the first sample with two items up in the second sample. This is a static deletion, or rather a deletion of a fixed number of items.

The sixth rule is the same as the fifth rule but in reverse to determine static insertions.

The seventh and final rule determines that the current items in the samples are non-unique changes.

The child functions are as follows:

- 1
2
3
4
5
6
7
8
9
10
11
- 12
13
14
15
16
17
18
19
20
- 21
22
23
24
25
26
27
28
29
30
- 31
32
33
34
35
36
37
38
39
- 40
41
42
43
44
45
46
47
- 48
49
50
51
52
53
54
55
56
57
58
- 59
60
61
62
63
64
65
66
67
68

equality = function diffview__opcodes_equality() {
do {
table[one[c]].one -= 1;
table[one[c]].two -= 1;
c += 1;
d += 1;
} while (c < lena && d < lenb && one[c] === two[d]);
fix(["equal", a, c, b, d]);
b = d - 1;
a = c - 1;
},
deletion = function diffview__opcodes_deletion() {
do {
table[one[c]].one -= 1;
c += 1;
} while (c < lena && table[one[c]].two < 1);
fix(["delete", a, c, -1, -1]);
a = c - 1;
b = d - 1;
},
deletionStatic = function diffview__opcodes_deletionStatic() {
table[one[a]].one -= 1;
fix([
"delete", a, a + 1,
-1,
-1
]);
a = c;
b = d - 1;
},
insertion = function diffview__opcodes_insertion() {
do {
table[two[d]].two -= 1;
d += 1;
} while (d < lenb && table[two[d]].one < 1);
fix(["insert", -1, -1, b, d]);
a = c - 1;
b = d - 1;
},
insertionStatic = function diffview__opcodes_insertionStatic() {
table[two[b]].two -= 1;
fix([
"insert", -1, -1, b, b + 1
]);
a = c - 1;
b = d;
},
replacement = function diffview__opcodes_replacement() {
do {
table[one[c]].one -= 1;
table[two[d]].two -= 1;
c += 1;
d += 1;
} while (c < lena && d < lenb && table[one[c]].two > 0 && table[two[d]].one > 0);
fix(["replace", a, c, b, d]);
a = c - 1;
b = d - 1;
},
replaceUniques = function diffview__opcodes_replaceUniques() {
do {
table[one[c]].one -= 1;
c += 1;
d += 1;
} while (c < lena && d < lenb && table[one[c]].two < 1 && table[two[d]].one < 1);
fix(["replace", a, c, b, d]);
a = c - 1;
b = d - 1;
}

It must be noted that most of the these child functions do contain their own loops, but these loops are not duplicates. The primary parent loop, the so called third and final loop, is adjusted forward by the distance these smaller loops increment so that there is no repetition or lose of execution speed.

The fix function

We can see from the child function code that arrays are generated which appear to be the format described as the child arrays of the output. Instead of pushing this data to the output array directly instead we are passing it through a function named fix which serves as a sort of quality control filter. This function checks for things like:

A diff type immediately following the same diff type, which can be combined into a single child array for the output
An insertion immediately following a deletion, or vise versa, which should likely be a replacement
Whether false positives generated from replacements immediately following an insertion or deletion can be eliminated

The general idea is to only accept a child array of the output as an argument and the make a decision after comparing it against the previous child array of the output. Aside from two narrow edge cases there is no analysis of the original data. Think of it like an appeals court in that no new evidence is allowed, because only the laws under scrutiny.

The code for the fix function is as follows:

fix = function diffview__opcodes_fix(code) {
var prior = codes[codes.length - 1];
if (prior !== undefined) {
if (prior[0] === code[0]) {
if (code[0] === "replace" || code[0] === "equal") {
prior[2] = code[2];
prior[4] = code[4];
} else if (code[0] === "delete") {
prior[2] = code[2];
} else if (code[0] === "insert") {
prior[4] = code[4];
}
return;
}
if (prior[0] === "insert" && prior[4] - prior[3] === 1) {
if (code[2] - code[1] === 1) {
if (code[0] === "replace") {
prior[0] = "replace";
prior[1] = code[1];
prior[2] = code[2];
code[0] = "insert";
code[1] = -1;
code[2] = -1;
} else if (code[0] === "delete") {
code[0] = "replace";
code[3] = prior[3];
code[4] = prior[4];
codes.pop();
prior = codes[codes.length - 1];
if (prior[0] === "replace") {
prior[2] = code[2];
prior[4] = code[4];
return;
}
}
} else if (code[0] === "delete") {
prior[0] = "replace";
prior[1] = code[1];
prior[2] = code[1] + 1;
code[1] = code[1] + 1;
} else if (code[0] === "replace") {
prior[0] = "replace";
prior[1] = code[1];
prior[2] = code[1] + 1;
c = prior[2];
d = prior[4];
return;
}
} else if (prior[0] === "insert" && code[0] === "delete" && code[2] - code[1] === 1) {
prior[4] = prior[4] - 1;
code[0] = "replace";
code[3] = prior[4];
code[4] = prior[4] + 1;
} else if (prior[0] === "delete" && prior[2] - prior[1] === 1) {
if (code[4] - code[3] === 1) {
if (code[0] === "replace") {
prior[0] = "replace";
prior[3] = code[3];
prior[4] = code[4];
code[0] = "delete";
code[3] = -1;
code[4] = -1;
} else if (code[0] === "insert") {
code[0] = "replace";
code[1] = prior[1];
code[2] = prior[2];
codes.pop();
prior = codes[codes.length - 1];
if (prior[0] === "replace") {
prior[2] = code[2];
prior[4] = code[4];
return;
}
}
} else if (code[0] === "insert") {
prior[0] = "replace";
prior[3] = code[3];
prior[4] = code[3] + 1;
code[3] = code[3] + 1;
} else if (code[0] === "replace") {
prior[0] = "replace";
prior[3] = code[3];
prior[4] = code[4] + 1;
c = prior[2];
d = prior[4];
return;
}
} else if (prior[0] === "delete" && code[0] === "insert" && code[4] - code[3] === 1) {
prior[2] = prior[2] - 1;
code[0] = "replace";
code[1] = prior[2];
code[2] = prior[2] + 1;
} else if (prior[0] === "replace") {
if (code[0] === "delete") {
if (one[code[2] - 1] === two[prior[4] - 1]) {
if (prior[2] - prior[1] > 1) {
prior[4] = prior[4] - 1;
}
c = c - 1;
d = d - 1;
return;
}
if (one[code[2]] === two[prior[4] - 1]) {
if (prior[2] - prior[1] > 1) {
prior[2] = prior[2] - 1;
prior[4] = prior[4] - 11;
table[one[c - 1]][0] = table[one[c - 1]][0] - 1;
}
}
} else if (code[0] === "insert") {
if (one[prior[2] - 1] === two[code[4] - 1]) {
if (prior[2] - prior[1] > 1) {
prior[2] = prior[2] - 1;
}
c = c - 1;
d = d - 1;
return;
}
if (one[code[2] - 1] === two[prior[4]]) {
if (prior[4] - prior[3] > 1) {
prior[2] = prior[2] - 1;
prior[4] = prior[4] - 1;
table[two[d - 1]][1] = table[two[d - 1]][1] - 1;
}
}
}
}
}
codes.push(code);
};

Done!

That is the entire diff algorithm. Comments, feedback, and pull requests are welcome. Here is the complete function used in the Pretty Diff application:

- 1
2
- 3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
- 22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
- 152
153
154
155
156
157
158
159
160
161
162
- 163
164
165
166
167
168
169
170
171
- 172
173
174
175
176
177
178
179
180
181
- 182
183
184
185
186
187
188
189
190
- 191
192
193
194
195
196
197
198
- 199
200
201
202
203
204
205
206
207
208
209
- 210
211
212
213
214
215
216
217
218
219
220
- 221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
- 236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
- 254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299

function diffview__opcodes() {
var table = {},
lines = function diff_opcodes_lines(str) {
str = str
.replace(/\r\n/g, "\n")
.replace(/\r/g, "\n");
return str.split("\n");
},
one = (typeof options.source === "string")
? lines(options.source)
: options.source,
two = (typeof options.diff === "string")
? lines(options.diff)
: options.diff,
lena = one.length,
lenb = two.length,
a = 0,
b = 0,
c = 0,
d = 0,
codes = [],
fix = function diffview__opcodes_fix(code) {
var prior = codes[codes.length - 1];
if (prior !== undefined) {
if (prior[0] === code[0]) {
if (code[0] === "replace" || code[0] === "equal") {
prior[2] = code[2];
prior[4] = code[4];
} else if (code[0] === "delete") {
prior[2] = code[2];
} else if (code[0] === "insert") {
prior[4] = code[4];
}
return;
}
if (prior[0] === "insert" && prior[4] - prior[3] === 1) {
if (code[2] - code[1] === 1) {
if (code[0] === "replace") {
prior[0] = "replace";
prior[1] = code[1];
prior[2] = code[2];
code[0] = "insert";
code[1] = -1;
code[2] = -1;
} else if (code[0] === "delete") {
code[0] = "replace";
code[3] = prior[3];
code[4] = prior[4];
codes.pop();
prior = codes[codes.length - 1];
if (prior[0] === "replace") {
prior[2] = code[2];
prior[4] = code[4];
return;
}
}
} else if (code[0] === "delete") {
prior[0] = "replace";
prior[1] = code[1];
prior[2] = code[1] + 1;
code[1] = code[1] + 1;
} else if (code[0] === "replace") {
prior[0] = "replace";
prior[1] = code[1];
prior[2] = code[1] + 1;
c = prior[2];
d = prior[4];
return;
}
} else if (prior[0] === "insert" && code[0] === "delete" && code[2] - code[1] === 1) {
prior[4] = prior[4] - 1;
code[0] = "replace";
code[3] = prior[4];
code[4] = prior[4] + 1;
} else if (prior[0] === "delete" && prior[2] - prior[1] === 1) {
if (code[4] - code[3] === 1) {
if (code[0] === "replace") {
prior[0] = "replace";
prior[3] = code[3];
prior[4] = code[4];
code[0] = "delete";
code[3] = -1;
code[4] = -1;
} else if (code[0] === "insert") {
code[0] = "replace";
code[1] = prior[1];
code[2] = prior[2];
codes.pop();
prior = codes[codes.length - 1];
if (prior[0] === "replace") {
prior[2] = code[2];
prior[4] = code[4];
return;
}
}
} else if (code[0] === "insert") {
prior[0] = "replace";
prior[3] = code[3];
prior[4] = code[3] + 1;
code[3] = code[3] + 1;
} else if (code[0] === "replace") {
prior[0] = "replace";
prior[3] = code[3];
prior[4] = code[4] + 1;
c = prior[2];
d = prior[4];
return;
}
} else if (prior[0] === "delete" && code[0] === "insert" && code[4] - code[3] === 1) {
prior[2] = prior[2] - 1;
code[0] = "replace";
code[1] = prior[2];
code[2] = prior[2] + 1;
} else if (prior[0] === "replace") {
if (code[0] === "delete") {
if (one[code[2] - 1] === two[prior[4] - 1]) {
if (prior[2] - prior[1] > 1) {
prior[4] = prior[4] - 1;
}
c = c - 1;
d = d - 1;
return;
}
if (one[code[2]] === two[prior[4] - 1]) {
if (prior[2] - prior[1] > 1) {
prior[2] = prior[2] - 1;
prior[4] = prior[4] - 11;
table[one[c - 1]][0] = table[one[c - 1]][0] - 1;
}
}
} else if (code[0] === "insert") {
if (one[prior[2] - 1] === two[code[4] - 1]) {
if (prior[2] - prior[1] > 1) {
prior[2] = prior[2] - 1;
}
c = c - 1;
d = d - 1;
return;
}
if (one[code[2] - 1] === two[prior[4]]) {
if (prior[4] - prior[3] > 1) {
prior[2] = prior[2] - 1;
prior[4] = prior[4] - 1;
table[two[d - 1]][1] = table[two[d - 1]][1] - 1;
}
}
}
}
}
codes.push(code);
},
equality = function diffview__opcodes_equality() {
do {
table[one[c]][0] = table[one[c]][0] - 1;
table[one[c]][1] = table[one[c]][1] - 1;
c = c + 1;
d = d + 1;
} while (c < lena && d < lenb && one[c] === two[d]);
fix(["equal", a, c, b, d]);
b = d - 1;
a = c - 1;
},
deletion = function diffview__opcodes_deletion() {
do {
table[one[c]][0] = table[one[c]][0] - 1;
c = c + 1;
} while (c < lena && table[one[c]][1] < 1);
fix(["delete", a, c, -1, -1]);
a = c - 1;
b = d - 1;
},
deletionStatic = function diffview__opcodes_deletionStatic() {
table[one[a]][0] = table[one[a]][0] - 1;
fix([
"delete", a, a + 1,
-1,
-1
]);
a = c;
b = d - 1;
},
insertion = function diffview__opcodes_insertion() {
do {
table[two[d]][1] = table[two[d]][1] - 1;
d = d + 1;
} while (d < lenb && table[two[d]][0] < 1);
fix(["insert", -1, -1, b, d]);
a = c - 1;
b = d - 1;
},
insertionStatic = function diffview__opcodes_insertionStatic() {
table[two[b]][1] = table[two[b]][1] - 1;
fix([
"insert", -1, -1, b, b + 1
]);
a = c - 1;
b = d;
},
replacement = function diffview__opcodes_replacement() {
do {
table[one[c]][0] = table[one[c]][0] - 1;
table[two[d]][1] = table[two[d]][1] - 1;
c = c + 1;
d = d + 1;
} while (c < lena && d < lenb && table[one[c]][1] > 0 && table[two[d]][0] > 0);
fix(["replace", a, c, b, d]);
a = c - 1;
b = d - 1;
},
replaceUniques = function diffview__opcodes_replaceUniques() {
do {
table[one[c]][0] = table[one[c]][0] - 1;
c = c + 1;
d = d + 1;
} while (c < lena && d < lenb && table[one[c]][1] < 1 && table[two[d]][0] < 1);
fix(["replace", a, c, b, d]);
a = c - 1;
b = d - 1;
};
// * First Pass, account for lines from first file
// * build the table from the second file
do {
if (options.diffspaceignore === true) {
two[b] = two[b].replace(/\s+/g, "");
}
if (table[two[b]] === undefined) {
table[two[b]] = [0, 1];
} else {
table[two[b]][1] = table[two[b]][1] + 1;
}
b = b + 1;
} while (b < lenb);
// * Second Pass, account for lines from second file
// * build the table from the first file
lena = one.length;
a = 0;
do {
if (options.diffspaceignore === true) {
one[a] = one[a].replace(/\s+/g, "");
}
if (table[one[a]] === undefined) {
table[one[a]] = [1, 0];
} else {
table[one[a]][0] = table[one[a]][0] + 1;
}
a = a + 1;
} while (a < lena);
a = 0;
b = 0;
// find all equality... differences are what's left over solve only for the
// second set test removing reverse test removing undefined checks for table
// refs
do {
c = a;
d = b;
if (one[a] === two[b]) {
equality();
} else if (table[one[a]][1] < 1 && table[two[b]][0] < 1) {
replaceUniques();
} else if (table[one[a]][1] < 1 && one[a + 1] !== two[b + 2]) {
deletion();
} else if (table[two[b]][0] < 1 && one[a + 2] !== two[b + 1]) {
insertion();
} else if (table[one[a]][0] - table[one[a]][1] === 1 && one[a + 1] !== two[b + 2]) {
deletionStatic();
} else if (table[two[b]][1] - table[two[b]][0] === 1 && one[a + 2] !== two[b + 1]) {
insertionStatic();
} else if (one[a + 1] === two[b]) {
deletion();
} else if (one[a] === two[b + 1]) {
insertion();
} else {
replacement();
}
a = a + 1;
b = b + 1;
} while (a < lena && b < lenb);
if (lena - a === lenb - b) {
if (one[a] === two[b]) {
fix(["equal", a, lena, b, lenb]);
} else {
fix(["replace", a, lena, b, lenb]);
}
} else if (a < lena) {
fix(["delete", a, lena, -1, -1]);
} else if (b < lenb) {
fix(["insert", -1, -1, b, lenb]);
}
return codes;
}