Understanding how to compare

It is more about the equality than the differences

A good diff algorithm will attempt to identify as much equality as possible. Everything else qualifies as differences. The metric for quality and precision is a smaller count of captured differences. The smaller this number the better, presuming differences aren't escaping undetected.

False negatives, which is allowing differences through without detection, is really bad. This is absolute failure in a diff algorithm. False positives, which is identifying more differences than there actually are is also bad, but a false positive is much better than a false negative. This means it is safer to report more differences than are actually present, which is why a higher number of reported differences is less precise. Maximum precision is reporting differences without any false negatives or false positives.

One way to achieve higher precision is to first capture as much equality as possible. For everything else that is left over prioritize unique items first and report these items as differences. Finally determine the next bit of equality or uniqueness and everything in between is either a change, an insertion, or a deletion.

Algorithm quality

The primary priorities when writing this kind of code are execution speed, precision (as previous described), and code simplicity. In most cases precision is the most important factor in code design, but in some cases speed is more important when processing large input or a batch of thousands of files. Simplicity is necessary so that other people can understand the code and modify it as necessary to improve upon the design and additional features. No algorithm is ever capable of meeting all needs for all use cases, so it is important that other people can understand the code with minimal effort.

Speed

Faster execution is the result of a couple of things. The most important consideration for making an algorithm faster is to reduce the number of passes through data where possible. After that eliminate all costly operations. In most programming languages simple arithmetic and static string comparisons are cheap to execute particularly if the examined strings aren't changing. The theoretical minimum number of data passes is two as you have to read the contents of each submitted sample. Pretty Diff achieves speed in its algorithm by only 3 complete passes through data and taking all possible effort to never repeat a step or loop iteration. The Pretty Diff approach is linear and predictable where the number of interations passing through data is computed as: number of iterations from the first sample + number of iterations from the second sample + the number of iterations from the smallest of those samples. Performance of the mentioned approach is heavily based upon key access in a hash map, which will likely vary by programming language.

The theoretical minimum number of data passes is two as you have to read the contents of each submitted sample. Until we discover a way to perform a comparison without reading from the samples we can safely assume 2 data passes is the fastest possible approach. Between that and the Pretty Diff approach there are two possibilities for increased performance. The first possibility is to make decisions and write output immediately upon reading from the second sample so as to have only two data passes. The challenge with this approach is that analysis occurs in the moment of the current loop interation without knowledge of what comes later in the second sample. A more practical second performance possibility is to write a smaller hash map. Writing a smaller hash map means moving some of the decision tree up front before a separate formal analysis phase. In order for this approach to be viable this early step in logic must be tiny, non-replicating, and a different means of iteration must be devised to account for a data store of unpredictable size.

This page blew up on Hacker News recently and many comments suggested this approach could not possibly be faster than the Myers' O(ND) approach. That may or may not be true and no evidence was provided either way (conjecture is not evidence).

In terms of experimental criteria algorithms are themselves largely irrelevant. More important is the implementation of those algorithms. If exercising the Myers' approach makes fewer decisions and has fewer total data passes, as described in the previous paragraph, then it likely is faster. I am calling out the word total because this makes all the difference. Many of the diff applications I looked at don't provide a complete third data pass. Instead they provide the minimum two complete data passes over the samples and various smaller passes as calculated by block moves and edit distances. If these various fractional passes are non-linear, which means starting and stopping in different places respectively, their performance is less predictable. If these fractional passes are non-linear and touch any given data index more than once they have repetitive logic, and likely are not as fast. To affirmatively guarantee superior performance over the Pretty Diff approach there needs to be fewer passes over data, which means no repetition and a smaller number of iterations. Preditability ensures the performance of an application scales roughtly proportionately to the size of the provided samples, where an unpredictable approach would scale disproportionately (slower over time). I safe roughtly because things in physical reality always mess this up like: solar flares, memory block limitations, CPU heat, and so forth.

I believe the approach taken here is fast. I honestly cannot say, scientifically, it is the fastest ever (or slowest) approach for its level of accuracy without also writing alternate algorithms into applications with identical application constraints. Neither can anyone else. I can safely say this approach is the fastest ever comparative algorithm for its level of predictability and precision.

Output format

The Pretty Diff algorithm inherits its output format from the application jsdifflib. The output is a big array containing child arrays at each index. The child arrays each contain 5 indexes in the format: type, start in first sample, end in first sample, start in second sample, end in second sample. There are 4 types defined in the output: "equal", "replace", "insert", "delete". The "equal" type means both array index items are identical at the indexes compared. The "replace" type means that changes are present in the indexes compared. The "insert" type means the index of the second sample is unique to the second sample. The "delete" type means the index of the first sample is unique to the first sample.

How the algorithm works at a high level

Getting by with a hash map and some counts

The Pretty Diff algorithm takes after the Paul Heckel algorithm. First thing is to transform the string input into arrays. The most primative way to do this is to split the input by lines where each line of input is an index in an arry. Once the two input samples are converted to arrays they will each have to be examined.

The first loop will walk through an array representing one of the code samples and make a decision for a hash map which I call table in the code. I simply named the two arrays one and two. Each index of the array, by default a line of code, will serve as a key in the hash map named table. If this key does not exist then we will add it. The value for each key is an object with two properites named one and two. These properties are simply a count indicating the number of times the key is encountered in the two arrays. If the key is already present then we will simply increase the value of properties one or two respective of the current array.

Here is an example of this step in code:

  1. 1
  2. 2
  3. 3
  4. 4
  5. 5
  6. 6
  7. 7
  8. 8
  9. 9
  10. 10
  11. 11
  1. do {
  2. if (table[two[b]] === undefined) {
  3. table[two[b]] = {
  4. one: 0,
  5. two: 1
  6. };
  7. } else {
  8. table[two[b]].two += 1;
  9. }
  10. b += 1;
  11. } while (b < lenb);

The third and final pass through the data

Once the table is fully populated with all indexes from both arrays one and two we need a third and final loop to walk across the data and make some simple decisions. Here is this complete step in code.

  1. 1
  2. 2
  3. 3
  4. 4
  5. 5
  6. 6
  7. 7
  8. 8
  9. 9
  10. 10
  11. 11
  12. 12
  13. 13
  14. 14
  15. 15
  16. 16
  17. 17
  18. 18
  19. 19
  20. 20
  21. 21
  1. do {
  2. c = a;
  3. d = b;
  4. if (one[a] === two[b]) {
  5. equality();
  6. } else if (table[one[a]].two < 1 && table[two[b]].one < 1) {
  7. replaceUniques();
  8. } else if (table[one[a]].two < 1 && one[a + 1] !== two[b + 2]) {
  9. deletion();
  10. } else if (table[two[b]].one < 1 && one[a + 2] !== two[b + 1]) {
  11. insertion();
  12. } else if (table[one[a]].one - table[one[a]].two === 1 && one[a + 1] !== two[b + 2]) {
  13. deletionStatic();
  14. } else if (table[two[b]].two - table[two[b]].one === 1 && one[a + 2] !== two[b + 1]) {
  15. insertionStatic();
  16. } else {
  17. replacement();
  18. }
  19. a += 1;
  20. b += 1;
  21. } while (a < lena && b < lenb);

Before I go any further I want to be clear this logic is fast and simple, but it isn't precise at all. We will fix this later with a child function named fix. I chose to make corrections for precision as a secondary step so as to not disrupt performance and isolate complexity into a separate single location.

In this code references a and b are simply positive integer incrementors. The a reference is always associated with the one array while the b reference is always associated with the two array. This is necessary so that each array may increment independently without collision. The c and d references are secondary incrementors used as closures in the child functions. These secondary incrementors allow for child loops to occur without mutation to the primary incrementors.

The first thing that happens in this loop is an attempt to discover equality. Everything else is less important and so happens later in the decision tree.

The second rule is to identify items that occur only one more time in each array. If the next item to occur in the arrays is not equal but occurs just once more then we know the item is a unique change.

The third rule is to determine if the current item is a deletion. If the current item, and possibly additional following items, is present in the first sample but no longer exists in the second sample then it is a dynamic deletion.

The fourth rule is the same as the third rule but in reverse to determine dynamic insertions.

The fifth rule is to determine if the current occurs exactly one more time in the first sample than in the second sample and yet is not a match for the next item in the first sample with two items up in the second sample. This is a static deletion, or rather a deletion of a fixed number of items.

The sixth rule is the same as the fifth rule but in reverse to determine static insertions.

The seventh and final rule determines that the current items in the samples are non-unique changes.

The child functions are as follows:

  1. - 1
  2. 2
  3. 3
  4. 4
  5. 5
  6. 6
  7. 7
  8. 8
  9. 9
  10. 10
  11. 11
  12. - 12
  13. 13
  14. 14
  15. 15
  16. 16
  17. 17
  18. 18
  19. 19
  20. 20
  21. - 21
  22. 22
  23. 23
  24. 24
  25. 25
  26. 26
  27. 27
  28. 28
  29. 29
  30. 30
  31. - 31
  32. 32
  33. 33
  34. 34
  35. 35
  36. 36
  37. 37
  38. 38
  39. 39
  40. - 40
  41. 41
  42. 42
  43. 43
  44. 44
  45. 45
  46. 46
  47. 47
  48. - 48
  49. 49
  50. 50
  51. 51
  52. 52
  53. 53
  54. 54
  55. 55
  56. 56
  57. 57
  58. 58
  59. - 59
  60. 60
  61. 61
  62. 62
  63. 63
  64. 64
  65. 65
  66. 66
  67. 67
  68. 68
  1. equality = function diffview__opcodes_equality() {
  2. do {
  3. table[one[c]].one -= 1;
  4. table[one[c]].two -= 1;
  5. c += 1;
  6. d += 1;
  7. } while (c < lena && d < lenb && one[c] === two[d]);
  8. fix(["equal", a, c, b, d]);
  9. b = d - 1;
  10. a = c - 1;
  11. },
  12. deletion = function diffview__opcodes_deletion() {
  13. do {
  14. table[one[c]].one -= 1;
  15. c += 1;
  16. } while (c < lena && table[one[c]].two < 1);
  17. fix(["delete", a, c, -1, -1]);
  18. a = c - 1;
  19. b = d - 1;
  20. },
  21. deletionStatic = function diffview__opcodes_deletionStatic() {
  22. table[one[a]].one -= 1;
  23. fix([
  24. "delete", a, a + 1,
  25. -1,
  26. -1
  27. ]);
  28. a = c;
  29. b = d - 1;
  30. },
  31. insertion = function diffview__opcodes_insertion() {
  32. do {
  33. table[two[d]].two -= 1;
  34. d += 1;
  35. } while (d < lenb && table[two[d]].one < 1);
  36. fix(["insert", -1, -1, b, d]);
  37. a = c - 1;
  38. b = d - 1;
  39. },
  40. insertionStatic = function diffview__opcodes_insertionStatic() {
  41. table[two[b]].two -= 1;
  42. fix([
  43. "insert", -1, -1, b, b + 1
  44. ]);
  45. a = c - 1;
  46. b = d;
  47. },
  48. replacement = function diffview__opcodes_replacement() {
  49. do {
  50. table[one[c]].one -= 1;
  51. table[two[d]].two -= 1;
  52. c += 1;
  53. d += 1;
  54. } while (c < lena && d < lenb && table[one[c]].two > 0 && table[two[d]].one > 0);
  55. fix(["replace", a, c, b, d]);
  56. a = c - 1;
  57. b = d - 1;
  58. },
  59. replaceUniques = function diffview__opcodes_replaceUniques() {
  60. do {
  61. table[one[c]].one -= 1;
  62. c += 1;
  63. d += 1;
  64. } while (c < lena && d < lenb && table[one[c]].two < 1 && table[two[d]].one < 1);
  65. fix(["replace", a, c, b, d]);
  66. a = c - 1;
  67. b = d - 1;
  68. }

It must be noted that most of the these child functions do contain their own loops, but these loops are not duplicates. The primary parent loop, the so called third and final loop, is adjusted forward by the distance these smaller loops increment so that there is no repetition or lose of execution speed.

The fix function

We can see from the child function code that arrays are generated which appear to be the format described as the child arrays of the output. Instead of pushing this data to the output array directly instead we are passing it through a function named fix which serves as a sort of quality control filter. This function checks for things like:

  • A diff type immediately following the same diff type, which can be combined into a single child array for the output
  • An insertion immediately following a deletion, or vise versa, which should likely be a replacement
  • Whether false positives generated from replacements immediately following an insertion or deletion can be eliminated

The general idea is to only accept a child array of the output as an argument and the make a decision after comparing it against the previous child array of the output. Aside from two narrow edge cases there is no analysis of the original data. Think of it like an appeals court in that no new evidence is allowed, because only the laws under scrutiny.

The code for the fix function is as follows:

  1. - 1
  2. 2
  3. 3
  4. 4
  5. 5
  6. 6
  7. 7
  8. 8
  9. 9
  10. 10
  11. 11
  12. 12
  13. 13
  14. 14
  15. 15
  16. 16
  17. 17
  18. 18
  19. 19
  20. 20
  21. 21
  22. 22
  23. 23
  24. 24
  25. 25
  26. 26
  27. 27
  28. 28
  29. 29
  30. 30
  31. 31
  32. 32
  33. 33
  34. 34
  35. 35
  36. 36
  37. 37
  38. 38
  39. 39
  40. 40
  41. 41
  42. 42
  43. 43
  44. 44
  45. 45
  46. 46
  47. 47
  48. 48
  49. 49
  50. 50
  51. 51
  52. 52
  53. 53
  54. 54
  55. 55
  56. 56
  57. 57
  58. 58
  59. 59
  60. 60
  61. 61
  62. 62
  63. 63
  64. 64
  65. 65
  66. 66
  67. 67
  68. 68
  69. 69
  70. 70
  71. 71
  72. 72
  73. 73
  74. 74
  75. 75
  76. 76
  77. 77
  78. 78
  79. 79
  80. 80
  81. 81
  82. 82
  83. 83
  84. 84
  85. 85
  86. 86
  87. 87
  88. 88
  89. 89
  90. 90
  91. 91
  92. 92
  93. 93
  94. 94
  95. 95
  96. 96
  97. 97
  98. 98
  99. 99
  100. 100
  101. 101
  102. 102
  103. 103
  104. 104
  105. 105
  106. 106
  107. 107
  108. 108
  109. 109
  110. 110
  111. 111
  112. 112
  113. 113
  114. 114
  115. 115
  116. 116
  117. 117
  118. 118
  119. 119
  120. 120
  121. 121
  122. 122
  123. 123
  124. 124
  125. 125
  126. 126
  127. 127
  128. 128
  129. 129
  130. 130
  131. 131
  132. 132
  133. 133
  134. 134
  135. 135
  136. 136
  137. 137
  138. 138
  139. 139
  140. 140
  141. 141
  142. 142
  143. 143
  144. 144
  145. 145
  146. 146
  147. 147
  148. 148
  149. 149
  150. 150
  151. 151
  152. 152
  153. 153
  154. 154
  155. 155
  156. 156
  157. 157
  1. fix = function diffview__opcodes_fix(code) {
  2. var prior = codes[codes.length - 1],
  3. e = 0;
  4. if (prior !== undefined) {
  5. if (prior[0] === code[0]) {
  6. if (code[0] === "replace" || code[0] === "equal") {
  7. prior[2] = code[2];
  8. prior[4] = code[4];
  9. } else if (code[0] === "delete") {
  10. prior[2] = code[2];
  11. } else if (code[0] === "insert") {
  12. prior[4] = code[4];
  13. }
  14. return;
  15. }
  16. if (prior[0] === "insert" && prior[4] - prior[3] === 1) {
  17. if (code[2] - code[1] === 1) {
  18. if (code[0] === "replace") {
  19. prior[0] = "replace";
  20. prior[1] = code[1];
  21. prior[2] = code[2];
  22. code[0] = "insert";
  23. code[1] = -1;
  24. code[2] = -1;
  25. } else if (code[0] === "delete") {
  26. code[0] = "replace";
  27. code[3] = prior[3];
  28. code[4] = prior[4];
  29. codes.pop();
  30. prior = codes[codes.length - 1];
  31. if (prior !== undefined && prior[0] === "replace") {
  32. prior[2] = code[2];
  33. prior[4] = code[4];
  34. return;
  35. }
  36. }
  37. } else if (code[0] === "delete") {
  38. prior[0] = "replace";
  39. prior[1] = code[1];
  40. prior[2] = code[1] + 1;
  41. code[1] += 1;
  42. } else if (code[0] === "replace") {
  43. e = a;
  44. do {
  45. e += 1;
  46. } while (e < lena && one[e] !== two[b]);
  47. e = e - (a + 1);
  48. if (code[2] - code[1] >= e) {
  49. prior[0] = "replace";
  50. prior[1] = code[1];
  51. prior[2] = code[1] + 1;
  52. code[0] = "delete";
  53. code[1] = code[1] + 1;
  54. code[2] = code[1] + e;
  55. code[3] = -1;
  56. code[4] = -1;
  57. do {
  58. d -= 1;
  59. table[two[d]].two += 1;
  60. } while (d > b);
  61. }
  62. }
  63. } else if (prior[0] === "insert" && code[0] === "delete" && code[2] - code[1] === 1) {
  64. prior[4] -= 1;
  65. code[0] = "replace";
  66. code[3] = prior[4];
  67. code[4] = prior[4] + 1;
  68. } else if (prior[0] === "delete" && prior[2] - prior[1] === 1) {
  69. if (code[4] - code[3] === 1) {
  70. if (code[0] === "replace") {
  71. prior[0] = "replace";
  72. prior[3] = code[3];
  73. prior[4] = code[4];
  74. code[0] = "delete";
  75. code[3] = -1;
  76. code[4] = -1;
  77. } else if (code[0] === "insert") {
  78. code[0] = "replace";
  79. code[1] = prior[1];
  80. code[2] = prior[2];
  81. codes.pop();
  82. prior = codes[codes.length - 1];
  83. if (prior !== undefined && prior[0] === "replace") {
  84. prior[2] = code[2];
  85. prior[4] = code[4];
  86. return;
  87. }
  88. }
  89. } else if (code[0] === "insert") {
  90. prior[0] = "replace";
  91. prior[3] = code[3];
  92. prior[4] = code[3] + 1;
  93. code[3] += 1;
  94. } else if (code[0] === "replace") {
  95. e = b;
  96. do {
  97. e += 1;
  98. } while (e < lenb && two[e] !== one[a]);
  99. e = e - (b + 1);
  100. if (code[3] - code[4] >= e) {
  101. prior[0] = "replace";
  102. prior[3] = code[3];
  103. prior[4] = code[4] + 1;
  104. code[0] = "insert";
  105. code[1] = -1;
  106. code[2] = -1;
  107. code[3] = code[3] + 1;
  108. code[4] = code[4] + e;
  109. do {
  110. c -= 1;
  111. table[one[c]].one += 1;
  112. } while (c > a);
  113. }
  114. }
  115. } else if (prior[0] === "delete" && code[0] === "insert" && code[4] - code[3] === 1) {
  116. prior[2] -= 1;
  117. code[0] = "replace";
  118. code[1] = prior[2];
  119. code[2] = prior[2] + 1;
  120. } else if (prior[0] === "replace") {
  121. if (code[0] === "delete") {
  122. if (one[code[2] - 1] === two[prior[4] - 1]) {
  123. if (prior[2] - prior[1] > 1) {
  124. prior[4] -= 1;
  125. }
  126. c -= 1;
  127. d -= 1;
  128. return;
  129. }
  130. if (one[code[2]] === two[prior[4] - 1]) {
  131. if (prior[2] - prior[1] > 1) {
  132. prior[2] -= 1;
  133. prior[4] -= 1;
  134. table[one[c - 1]].one += 1;
  135. }
  136. }
  137. } else if (code[0] === "insert") {
  138. if (one[prior[2] - 1] === two[code[4] - 1]) {
  139. if (prior[2] - prior[1] > 1) {
  140. prior[2] -= 1;
  141. }
  142. c -= 1;
  143. d -= 1;
  144. return;
  145. }
  146. if (one[code[2] - 1] === two[prior[4]]) {
  147. if (prior[4] - prior[3] > 1) {
  148. prior[2] -= 1;
  149. prior[4] -= 1;
  150. table[two[d - 1]].two += 1;
  151. }
  152. }
  153. }
  154. }
  155. }
  156. codes.push(code);
  157. }

Done!

That is the entire diff algorithm. Comments, feedback, and pull requests are welcome. Here is the complete function used in the Pretty Diff application:

  1. - 1
  2. 2
  3. - 3
  4. 4
  5. 5
  6. 6
  7. 7
  8. 8
  9. 9
  10. 10
  11. 11
  12. 12
  13. 13
  14. 14
  15. 15
  16. 16
  17. 17
  18. 18
  19. 19
  20. 20
  21. 21
  22. - 22
  23. 23
  24. 24
  25. 25
  26. 26
  27. 27
  28. 28
  29. 29
  30. 30
  31. 31
  32. 32
  33. 33
  34. 34
  35. 35
  36. 36
  37. 37
  38. 38
  39. 39
  40. 40
  41. 41
  42. 42
  43. 43
  44. 44
  45. 45
  46. 46
  47. 47
  48. 48
  49. 49
  50. 50
  51. 51
  52. 52
  53. 53
  54. 54
  55. 55
  56. 56
  57. 57
  58. 58
  59. 59
  60. 60
  61. 61
  62. 62
  63. 63
  64. 64
  65. 65
  66. 66
  67. 67
  68. 68
  69. 69
  70. 70
  71. 71
  72. 72
  73. 73
  74. 74
  75. 75
  76. 76
  77. 77
  78. 78
  79. 79
  80. 80
  81. 81
  82. 82
  83. 83
  84. 84
  85. 85
  86. 86
  87. 87
  88. 88
  89. 89
  90. 90
  91. 91
  92. 92
  93. 93
  94. 94
  95. 95
  96. 96
  97. 97
  98. 98
  99. 99
  100. 100
  101. 101
  102. 102
  103. 103
  104. 104
  105. 105
  106. 106
  107. 107
  108. 108
  109. 109
  110. 110
  111. 111
  112. 112
  113. 113
  114. 114
  115. 115
  116. 116
  117. 117
  118. 118
  119. 119
  120. 120
  121. 121
  122. 122
  123. 123
  124. 124
  125. 125
  126. 126
  127. 127
  128. 128
  129. 129
  130. 130
  131. 131
  132. 132
  133. 133
  134. 134
  135. 135
  136. 136
  137. 137
  138. 138
  139. 139
  140. 140
  141. 141
  142. 142
  143. 143
  144. 144
  145. 145
  146. 146
  147. 147
  148. 148
  149. 149
  150. 150
  151. 151
  152. 152
  153. 153
  154. 154
  155. 155
  156. 156
  157. 157
  158. 158
  159. 159
  160. 160
  161. 161
  162. 162
  163. 163
  164. 164
  165. 165
  166. 166
  167. 167
  168. 168
  169. 169
  170. 170
  171. 171
  172. 172
  173. 173
  174. 174
  175. 175
  176. 176
  177. 177
  178. 178
  179. - 179
  180. 180
  181. 181
  182. 182
  183. 183
  184. 184
  185. 185
  186. 186
  187. 187
  188. 188
  189. 189
  190. - 190
  191. 191
  192. 192
  193. 193
  194. 194
  195. 195
  196. 196
  197. 197
  198. 198
  199. - 199
  200. 200
  201. 201
  202. 202
  203. 203
  204. 204
  205. 205
  206. 206
  207. 207
  208. 208
  209. - 209
  210. 210
  211. 211
  212. 212
  213. 213
  214. 214
  215. 215
  216. 216
  217. 217
  218. - 218
  219. 219
  220. 220
  221. 221
  222. 222
  223. 223
  224. 224
  225. 225
  226. - 226
  227. 227
  228. 228
  229. 229
  230. 230
  231. 231
  232. 232
  233. 233
  234. 234
  235. 235
  236. 236
  237. - 237
  238. 238
  239. 239
  240. 240
  241. 241
  242. 242
  243. 243
  244. 244
  245. 245
  246. 246
  247. 247
  248. 248
  249. 249
  250. 250
  251. 251
  252. 252
  253. 253
  254. 254
  255. 255
  256. 256
  257. 257
  258. 258
  259. 259
  260. 260
  261. 261
  262. 262
  263. 263
  264. 264
  265. 265
  266. 266
  267. 267
  268. 268
  269. 269
  270. 270
  271. 271
  272. 272
  273. 273
  274. 274
  275. 275
  276. 276
  277. 277
  278. 278
  279. 279
  280. 280
  281. 281
  282. 282
  283. 283
  284. 284
  285. 285
  286. 286
  287. 287
  288. 288
  289. 289
  290. 290
  291. 291
  292. 292
  293. 293
  294. 294
  295. 295
  296. 296
  297. 297
  298. 298
  299. 299
  300. 300
  301. 301
  302. 302
  303. 303
  304. 304
  305. 305
  306. 306
  307. 307
  308. 308
  309. 309
  310. 310
  311. 311
  312. 312
  313. 313
  314. 314
  315. 315
  316. 316
  317. 317
  318. 318
  1. function diffview__opcodes() {
  2. var table = {},
  3. lines = function diff_opcodes_lines(str) {
  4. str = str
  5. .replace(/\r\n/g, "\n")
  6. .replace(/\r/g, "\n");
  7. return str.split("\n");
  8. },
  9. one = (typeof options.source === "string")
  10. ? lines(options.source)
  11. : options.source,
  12. two = (typeof options.diff === "string")
  13. ? lines(options.diff)
  14. : options.diff,
  15. lena = one.length,
  16. lenb = two.length,
  17. a = 0,
  18. b = 0,
  19. c = 0,
  20. d = 0,
  21. codes = [],
  22. fix = function diffview__opcodes_fix(code) {
  23. var prior = codes[codes.length - 1],
  24. e = 0;
  25. if (prior !== undefined) {
  26. if (prior[0] === code[0]) {
  27. if (code[0] === "replace" || code[0] === "equal") {
  28. prior[2] = code[2];
  29. prior[4] = code[4];
  30. } else if (code[0] === "delete") {
  31. prior[2] = code[2];
  32. } else if (code[0] === "insert") {
  33. prior[4] = code[4];
  34. }
  35. return;
  36. }
  37. if (prior[0] === "insert" && prior[4] - prior[3] === 1) {
  38. if (code[2] - code[1] === 1) {
  39. if (code[0] === "replace") {
  40. prior[0] = "replace";
  41. prior[1] = code[1];
  42. prior[2] = code[2];
  43. code[0] = "insert";
  44. code[1] = -1;
  45. code[2] = -1;
  46. } else if (code[0] === "delete") {
  47. code[0] = "replace";
  48. code[3] = prior[3];
  49. code[4] = prior[4];
  50. codes.pop();
  51. prior = codes[codes.length - 1];
  52. if (prior !== undefined && prior[0] === "replace") {
  53. prior[2] = code[2];
  54. prior[4] = code[4];
  55. return;
  56. }
  57. }
  58. } else if (code[0] === "delete") {
  59. prior[0] = "replace";
  60. prior[1] = code[1];
  61. prior[2] = code[1] + 1;
  62. code[1] += 1;
  63. } else if (code[0] === "replace") {
  64. e = a;
  65. do {
  66. e += 1;
  67. } while (e < lena && one[e] !== two[b]);
  68. e = e - (a + 1);
  69. if (code[2] - code[1] >= e) {
  70. prior[0] = "replace";
  71. prior[1] = code[1];
  72. prior[2] = code[1] + 1;
  73. code[0] = "delete";
  74. code[1] = code[1] + 1;
  75. code[2] = code[1] + e;
  76. code[3] = -1;
  77. code[4] = -1;
  78. do {
  79. d -= 1;
  80. table[two[d]].two += 1;
  81. } while (d > b);
  82. }
  83. }
  84. } else if (prior[0] === "insert" && code[0] === "delete" && code[2] - code[1] === 1) {
  85. prior[4] -= 1;
  86. code[0] = "replace";
  87. code[3] = prior[4];
  88. code[4] = prior[4] + 1;
  89. } else if (prior[0] === "delete" && prior[2] - prior[1] === 1) {
  90. if (code[4] - code[3] === 1) {
  91. if (code[0] === "replace") {
  92. prior[0] = "replace";
  93. prior[3] = code[3];
  94. prior[4] = code[4];
  95. code[0] = "delete";
  96. code[3] = -1;
  97. code[4] = -1;
  98. } else if (code[0] === "insert") {
  99. code[0] = "replace";
  100. code[1] = prior[1];
  101. code[2] = prior[2];
  102. codes.pop();
  103. prior = codes[codes.length - 1];
  104. if (prior !== undefined && prior[0] === "replace") {
  105. prior[2] = code[2];
  106. prior[4] = code[4];
  107. return;
  108. }
  109. }
  110. } else if (code[0] === "insert") {
  111. prior[0] = "replace";
  112. prior[3] = code[3];
  113. prior[4] = code[3] + 1;
  114. code[3] += 1;
  115. } else if (code[0] === "replace") {
  116. e = b;
  117. do {
  118. e += 1;
  119. } while (e < lenb && two[e] !== one[a]);
  120. e = e - (b + 1);
  121. if (code[3] - code[4] >= e) {
  122. prior[0] = "replace";
  123. prior[3] = code[3];
  124. prior[4] = code[4] + 1;
  125. code[0] = "insert";
  126. code[1] = -1;
  127. code[2] = -1;
  128. code[3] = code[3] + 1;
  129. code[4] = code[4] + e;
  130. do {
  131. c -= 1;
  132. table[one[c]].one += 1;
  133. } while (c > a);
  134. }
  135. }
  136. } else if (prior[0] === "delete" && code[0] === "insert" && code[4] - code[3] === 1) {
  137. prior[2] -= 1;
  138. code[0] = "replace";
  139. code[1] = prior[2];
  140. code[2] = prior[2] + 1;
  141. } else if (prior[0] === "replace") {
  142. if (code[0] === "delete") {
  143. if (one[code[2] - 1] === two[prior[4] - 1]) {
  144. if (prior[2] - prior[1] > 1) {
  145. prior[4] -= 1;
  146. }
  147. c -= 1;
  148. d -= 1;
  149. return;
  150. }
  151. if (one[code[2]] === two[prior[4] - 1]) {
  152. if (prior[2] - prior[1] > 1) {
  153. prior[2] -= 1;
  154. prior[4] -= 1;
  155. table[one[c - 1]].one += 1;
  156. }
  157. }
  158. } else if (code[0] === "insert") {
  159. if (one[prior[2] - 1] === two[code[4] - 1]) {
  160. if (prior[2] - prior[1] > 1) {
  161. prior[2] -= 1;
  162. }
  163. c -= 1;
  164. d -= 1;
  165. return;
  166. }
  167. if (one[code[2] - 1] === two[prior[4]]) {
  168. if (prior[4] - prior[3] > 1) {
  169. prior[2] -= 1;
  170. prior[4] -= 1;
  171. table[two[d - 1]].two += 1;
  172. }
  173. }
  174. }
  175. }
  176. }
  177. codes.push(code);
  178. },
  179. equality = function diffview__opcodes_equality() {
  180. do {
  181. table[one[c]].one -= 1;
  182. table[one[c]].two -= 1;
  183. c += 1;
  184. d += 1;
  185. } while (c < lena && one[c] === two[d]);
  186. fix(["equal", a, c, b, d]);
  187. b = d - 1;
  188. a = c - 1;
  189. },
  190. deletion = function diffview__opcodes_deletion() {
  191. do {
  192. table[one[c]].one -= 1;
  193. c += 1;
  194. } while (c < lena && table[one[c]].two < 1);
  195. fix(["delete", a, c, -1, -1]);
  196. a = c - 1;
  197. b = d - 1;
  198. },
  199. deletionStatic = function diffview__opcodes_deletionStatic() {
  200. table[one[a]].one -= 1;
  201. fix([
  202. "delete", a, a + 1,
  203. -1,
  204. -1
  205. ]);
  206. a = c;
  207. b = d - 1;
  208. },
  209. insertion = function diffview__opcodes_insertion() {
  210. do {
  211. table[two[d]].two -= 1;
  212. d += 1;
  213. } while (d < lenb && table[two[d]].one < 1);
  214. fix(["insert", -1, -1, b, d]);
  215. a = c - 1;
  216. b = d - 1;
  217. },
  218. insertionStatic = function diffview__opcodes_insertionStatic() {
  219. table[two[b]].two -= 1;
  220. fix([
  221. "insert", -1, -1, b, b + 1
  222. ]);
  223. a = c - 1;
  224. b = d;
  225. },
  226. replacement = function diffview__opcodes_replacement() {
  227. do {
  228. table[one[c]].one -= 1;
  229. table[two[d]].two -= 1;
  230. c += 1;
  231. d += 1;
  232. } while (c < lena && d < lenb && table[one[c]].two > 0 && table[two[d]].one > 0);
  233. fix(["replace", a, c, b, d]);
  234. a = c - 1;
  235. b = d - 1;
  236. },
  237. replaceUniques = function diffview__opcodes_replaceUniques() {
  238. do {
  239. table[one[c]].one -= 1;
  240. c += 1;
  241. d += 1;
  242. } while (c < lena && d < lenb && table[one[c]].two < 1 && table[two[d]].one < 1);
  243. fix(["replace", a, c, b, d]);
  244. a = c - 1;
  245. b = d - 1;
  246. };
  247. do {
  248. if (options.diffspaceignore === true) {
  249. two[b] = two[b].replace(/\s+/g, "");
  250. }
  251. if (table[two[b]] === undefined) {
  252. table[two[b]] = {
  253. one: 0,
  254. two: 1
  255. };
  256. } else {
  257. table[two[b]].two += 1;
  258. }
  259. b += 1;
  260. } while (b < lenb);
  261. lena = one.length;
  262. a = 0;
  263. do {
  264. if (options.diffspaceignore === true) {
  265. one[a] = one[a].replace(/\s+/g, "");
  266. }
  267. if (table[one[a]] === undefined) {
  268. table[one[a]] = {
  269. one: 1,
  270. two: 0
  271. };
  272. } else {
  273. table[one[a]].one += 1;
  274. }
  275. a += 1;
  276. } while (a < lena);
  277. a = 0;
  278. b = 0;
  279. do {
  280. c = a;
  281. d = b;
  282. if (one[a] === two[b]) {
  283. equality();
  284. } else if (table[one[a]].two < 1 && table[two[b]].one < 1) {
  285. replaceUniques();
  286. } else if (table[one[a]].two < 1 && one[a + 1] !== two[b + 2]) {
  287. deletion();
  288. } else if (table[two[b]].one < 1 && one[a + 2] !== two[b + 1]) {
  289. insertion();
  290. } else if (table[one[a]].one - table[one[a]].two === 1 && one[a + 1] !== two[b + 2]) {
  291. deletionStatic();
  292. } else if (table[two[b]].two - table[two[b]].one === 1 && one[a + 2] !== two[b + 1]) {
  293. insertionStatic();
  294. } else {
  295. replacement();
  296. }
  297. a += 1;
  298. b += 1;
  299. } while (a < lena && b < lenb);
  300. if (lena - a === lenb - b) {
  301. if (one[a] === two[b]) {
  302. fix(["equal", a, lena, b, lenb]);
  303. } else {
  304. fix(["replace", a, lena, b, lenb]);
  305. }
  306. } else if (a < lena) {
  307. fix(["delete", a, lena, -1, -1]);
  308. } else if (b < lenb) {
  309. fix(["insert", -1, -1, b, lenb]);
  310. }
  311. return codes;
  312. };