How to write a parser

Purpose

This guide provides some basic introduction to terminology and techniques to describe parsing applications and then follows up with a simple step by step guide to write a parsing application. The scope of this guide is limited to formal techniques to interpret computer languages and datasets. It does not cover information related to compilation processes or tasks related to application execution.

Introduction

A parsing application merely provides the means of taking some input and transforming it into works that computers can understand. The ability to write parsers is to programming as the ability to write is to literacy, which is to say that learning to write a simple parsing application opens new worlds of instant capabilities. Some quick examples are a search engine spider, quickly scraping relevant data out of a spreadsheet, analyzing financial data, or quickly making sense of human language. The ability to understand and write parsing applications will instantly transform any mediocre programmer into a near rockstar.

The concepts, opinions, and theories expressed in this guide are universal to programming. The methods and approaches are expressed according to JavaScript language.

Definition of terms

  • abstract syntax tree (AST)

    An AST is a single organized body of output that describes tokens according to descriptors provided by the parser in an organization that defines the relationships of those tokens according the grammar(s) of the desired consuming application. This is often a big object where each data facet is a token with descriptions and child objects for each token as organized according to a language grammar. This is the most common output format provided by parsers.

    Here is an example of an AST generated by the Esprima parser for the code: var answer = 6 * 7;

    1. 1
    2. 2
    3. 3
    4. 4
    5. 5
    6. 6
    7. 7
    8. 8
    9. 9
    10. 10
    11. 11
    12. 12
    13. 13
    14. 14
    15. 15
    16. 16
    17. 17
    18. 18
    19. 19
    20. 20
    21. 21
    22. 22
    23. 23
    24. 24
    25. 25
    26. 26
    27. 27
    28. 28
    29. 29
    30. 30
    31. 31
    32. 32
    33. 33
    1. {
    2. "body" : [
    3. {
    4. "declarations": [
    5. {
    6. "id" : {
    7. "name": "answer",
    8. "type": "Identifier"
    9. },
    10. "init": {
    11. "left" : {
    12. "raw" : "6",
    13. "type" : "Literal",
    14. "value": 6
    15. },
    16. "operator": "*",
    17. "right" : {
    18. "raw" : "7",
    19. "type" : "Literal",
    20. "value": 7
    21. },
    22. "type" : "BinaryExpression"
    23. },
    24. "type": "VariableDeclarator"
    25. }
    26. ],
    27. "kind" : "var",
    28. "type" : "VariableDeclaration"
    29. }
    30. ],
    31. "sourceType": "script",
    32. "type" : "Program"
    33. }
  • grammar

    A second set of rules, after syntax, to determine how parsed tokens come together to form output that is intelligble to the consuming application. An example is that syntax determines how to form words and sentences in human written language, but grammar determines how those words and sentences come together in a way that provides understandability.

  • lexer

    A lexer is a utility that scans input looking for syntax rules in order to organize the input into tokens. The lexer will determine when a part of the input forms a code comment, for example, and where that comment ends so as to begin something else.

  • parallel array

    The term parallel array is a recognized computer science term that is used in this guide, but it is not terminology commonly associated with parsing. Parallel arrays are a technique for quickly populating data. This technique is similar to populating a

  • parser

    A parser is an analysis scheme to examine a list of tokens produced by a lexer to determine a variety of qualities like: data types, token relationships, validation, and other higher order qualities.

  • parse table

    An alternative output format, compared to AST. Instead of arranging data in a single giant object representing a tree of tokens a parse table outputs data in a table. The parse tables formed by Pretty Diff and demonstrated in this guide are formed using a technique called parallel arrays.

    This term should not be confused with an identical term used by the left factorization compilation scheme. Please observe the following example of a parse table generated by the Pretty Diff application for the code: <a><b>sample text</b></a>

    index attrs begin daddy jscom linen lines presv token types
    0 [] 0 "root" false 1 0 false <a> "start"
    1 [] 0 "a" false 1 0 false <b> "start"
    2 [] 1 "b" false 2 1 false sample text "content"
    3 [] 1 "b" false 3 1 false </b> "end"
    4 [] 1 "b" false 3 1 false </a> "end"
  • syntax

    A fancy term that refers to the defining rule set to determine the expected out. Syntax most commonly refers to the rules necessary to form a programming language, but represent the bounds by which any lexer applies even if not to form a programming language.

  • token

    A token is an element of output formed from examining the input against a set of rules.

  • type

    A type, commonly referred to as a language data type, is a categorical label for a given token. Some common examples of data types are: string, comment, object, null, and number.

Preferences, parsing the Pretty Diff way

In many parsing applications the lexer and parser are well separated subjects. In the common approach a lexer scans an input and produces tokens for the parser to analyze and the parser produces an AST for output. When the applications are well optimized they can execute more than twice as fast as the Pretty Diff approach, but there are limitations to this approach.

The parsers used in the Pretty Diff application combine the lexer and parser into a single operation. The lexer scans the input looking for syntax and once a token is identified it is immediately parsed before the lexer progresses forward. This allows for advanced decisions, such as code correction and grammar analysis, to occur immediately. These advanced features make the total execution time of the parsing operation slower, which can make the Pretty Diff approach to parsing appear more than twice as slow as other comparable parsers. Despite that the Pretty Diff appeoach is substantially faster and more simple than attempting to apply any such advanced analysis as a separate process outside the parser.

The Pretty Diff approach produces output in the form of parallel arrays instead of an AST format. The idea is that an AST can be created from a parse table approach provided one of the categories of data is structure and placement information, but a parse table cannot be created from an AST without running another parsing operation. The parse table approach also allows for sorting and analysis by selectively targeting various areas and data types without consideration for the output as a whole.

How to...

Setup and get started

One of the primary reasons I prefer to write in JavaScript is because lambda expressions are a native quality not hidden behind a convention. To get started I prefer to write a function against a single global reference that contains everything I need.

  1. - 1
  2. 2
  3. 3
  4. 4
  5. 5
  1. var myParser = function (options) {
  2. var token = [],
  3. types = [],
  4. parse = function () {};
  5. };

In the above code sample we can see a single global variable, myParser, that contains some declared variables. The references token and types will store data while parse is a child function to store all the lexer/parser relevant instructions. This will allow availability to the token and type data outside the parse function so that we can maintain separation of concerns and data availability without having to pass things around. Now let's jump into the parse child function where we are going to write a simple lexer/parser.

Writing a lexer

When writing a lexer I prefer to be as explicit as possible. The more explicit the code the lower the risk of unexpected results. This means the code will follow an imperative coding style. First, let's convert the input into a string, if it isn't a string already, and then into an array that we can loop through.

  1. - 1
  2. 2
  3. 3
  4. 4
  5. 5
  6. 6
  7. 7
  8. 8
  9. 9
  1. parse = function () {
  2. var data = options.input
  3. .input
  4. .toString()
  5. .split(""),
  6. len = data.length,
  7. a = 0;
  8. for (a = 0; a < len; a += 1) {}
  9. };

Converting the input into an array is not required, but it makes things much easier and faster to manipulate. Looping through a binary buffer, for instance, can be more challenging to think through as humans don't read binary or hexidecimal as fast as they read code and strings. Arrays are substantially faster and more expressive to evaluate than large strings.

Using a for to iterate through an array is now considered an anti-pattern in JavaScript since the ECMAScript5 version of the language provides a foreach method. I deliberately choose to use a for loop because there are times where it is necessary to jump around to various indexes or iterate across the input differently.

Now that we have the basic lexer written let's evaluate some syntax. We are going to evaluate a C language styled block comment. A block comment begins /* and ends with */. To impose additional separation of concerns we will put this rule behind another child function. First, lets create a rule and the child function:

  1. - 1
  2. 2
  3. 3
  4. 4
  5. 5
  6. 6
  7. 7
  8. 8
  9. 9
  10. 10
  11. 11
  12. 12
  13. 13
  14. 14
  1. parse = function () {
  2. var data = options.input
  3. .input
  4. .toString()
  5. .split(""),
  6. len = data.length,
  7. a = 0,
  8. commentBlock = function () {};
  9. for (a = 0; a < len; a += 1) {
  10. if (data[a] === "/" && data[a + 1] === "*") {
  11. commentBlock();
  12. }
  13. }
  14. };

Now lets define the lexical analysis for a block comment:

  1. - 1
  2. 2
  3. 3
  4. 4
  5. 5
  6. 6
  7. 7
  8. 8
  9. 9
  10. 10
  11. 11
  12. 12
  13. 13
  1. commentBlock = function () {
  2. var comment = [],
  3. b = 0;
  4. for (b = 0; b < len; b += 1) {
  5. comment.push(data[b]);
  6. if (data[b] === "/" && data[b - 1] === "*") {
  7. break;
  8. }
  9. }
  10. a = b;
  11. token.push(comment.join(""));
  12. types.push("comment-block");
  13. };

A couple of things happened. We created a new reference b as a separate iterator. In this case the secondary iterator isn't needed and is only present as an example that additional iterators can be used. Additional iterators allow the freedom to traverse the data independently from the primary iterator a. If you choose to create an additional interator be sure to reassign the value of the primary iterator before exiting the current function to avoid duplicated effort.

At the end of this function we push a token, the entire block comment, and a type. In this case we know the data type by knowing the syntax. This isn't always the case. We won't know if a word is a language keyword or a user defined reference without some additional effort. It is my opinion overall efficiency increases by supplying these additional evaluations directly into the parser. The additional tasks will always make the parser slower, which is wasted effort if it isn't needed. Unfortunately, it is extremely diffecult to know what is or isn't needed at parse time and performing these additional evaluations later, after the parsing completes, is far more expensive still.

Here is what the combined code looks like:

  1. - 1
  2. 2
  3. 3
  4. - 4
  5. 5
  6. 6
  7. 7
  8. 8
  9. 9
  10. 10
  11. 11
  12. - 12
  13. 13
  14. 14
  15. 15
  16. 16
  17. 17
  18. 18
  19. 19
  20. 20
  21. 21
  22. 22
  23. 23
  24. 24
  25. 25
  26. 26
  27. 27
  28. 28
  29. 29
  30. 30
  31. 31
  32. 32
  33. 33
  1. var myParser = function (options) {
  2. var token = [],
  3. types = [],
  4. parse = function () {
  5. var data = options
  6. .options
  7. .input
  8. .toString()
  9. .split(""),
  10. len = data.length,
  11. a = 0,
  12. commentBlock = function () {
  13. var comment = [],
  14. b = 0;
  15. for (b = 0; b < len; b += 1) {
  16. comment.push(data[b]);
  17. if (data[b] === "/" && data[b - 1] === "*") {
  18. break;
  19. }
  20. }
  21. a = b;
  22. token.push(comment.join(""));
  23. types.push("comment-block");
  24. };
  25. for (a = 0; a < len; a += 1) {
  26. if (data[a] === "/" && data[a + 1] === "*") {
  27. commentBlock();
  28. }
  29. }
  30. };
  31. parse();
  32. return {token: token, types: types};
  33. };

Define structures

In addition to describing tokens by data type it is also frequently necessary to describe them by the current structure in the application code. In markup based languages there are parent elements and child elements, so a given element's structure can be described by a parent element's tag name. In many programming languages this distinction is less clear.

For languages that use a C language based syntax an easy way to think about is in terms of matching start and end delimiters comprised of (, {, [, < and ), }, ], > respectively. The C language syntax family comprises JavaScript, Java, C#, TypeScript, and many other languages. A structure is defined by the delimiter character and context rules in a given language. For instance the [ and ] typically describe a list or array structure, but in C# it can also describe a data type attribute given the context [, word type, optional white space, and : (a colon). The easiest way to think about it is that you always know where you are in parsing the code, what has already been parsed, and some idea of what is coming. Always keeping that simple rule in mind helps prevent elaborate conventions from distracting the simple ideas of parsing structures with precision without guessing or rework.

Some structures allow context rules and conventions not permitted in other various structures. It is helpful to describe structures during the initial lexing and parsing process to allow decisions later without introducing an additional parsing step. An example is that it may be desirable to identify references, values, and their assignment with distinction which can vary in syntax by the containing structure even within a single language.

Pretty Diff uses two arrays to define structures in its parsers. This keeps the approach to structure consistent between the different parsers and allows descriptions of hierarchy without need for an AST. The Pretty Diff parsers use arrays named depth and daddy (in the case of markup) to describe the name of the current structure and an array named begin that defines the token index where this structure started. Common values, in C based syntax languages, for the depth array include but are not limited to: array, function, method, class, map, object, for, if, switch. In the markup parser the depth array values take the tag name of the parent element. From these arrays an AST or a DOM tree could be formed without additional parsing considerations.

Use the parsed output

Once the parser completes the parallel arrays will be populated with data. It is important that these parallel arrays always have the same number of indexes and each index describes the data at that index of the token array. Knowing this can save a lot of effort debugging failures later.

Since the output is merely a couple of arrays the data can be easily iterated over. The data can also be examined by specific data types through examination of the types array. Pretty Diff includes several additional parallel arrays for describing qualities such as white space, markup attributes, markup parent element, JavaScript code structure, and more. These additional arrays provide data immediately without need for additional parsing steps.

One use example is navigating data structures. The jspretty and markuppretty parsers both produce an array named begin. Use this data to walk up the code structures or walk up markup element hierarchy. The begin array stores the start index for a given code structure so, begin[a], will return a number less than or equal to "a" indicating the start type for the current structure where "a" is the index of any parsed token. The number at begin[begin[a] - 1] would then contain the index for the next heigher structure start point. A small loop could traverse this data to create an AST or walk up the DOM.

Extend the parser (in theory)

A well written parser is extensible, which means additional supplimental rules can be added later to allow new features at minimal collision with existing features. Before a parser can be extensible it has to be sturdy and confident in its current target language and conventions. The addition of new rules, features, or even languages dramatically increases the risk of errors and malformed output by the parser. Seek out diverse code examples and integrate them into some form to test automation. Constantly seek to discover input that will break the parser or output corrupted results. Once a corrupting code sample is discovered ensure it is added to test automation as a test unit.

A general rule is to define data type names as specifically and precisely as possible in the parser, perhaps even more specifically than the current need demands. When writing a parser all the given edge cases and collisions of various features of the target language likely aren't known. Unfortunately, many of these edge cases will be discovered in production once error reports surface from users. This is unfortunately true even for the most widely used of parsers. Extreme care for precision, even when currently unnecessary, helps mitigate some of these unknown errors.

Another general rule is to perceive the code in negatives, as in what can't or shouldn't be done. Thinking about the code's decisions in a more pessimistic view is a key way to limit risk by continously focusing on predictable operations in the face of unknown and diverse input. This form of thinking reenforces the idea that rules must be specific and not allow deviations from a desired predictable state of output. As new features are added think about them in terms of what they shouldn't do. The rules that define a new feature will likely be simple, but ensuring their presence does not open regression or complicate maintenance with extended complexity demands some special care and discipline.

Since code parsing is comprised by a series of simple and mundane tasks with the primary goals of sturdiness and performance plus a distant secondary goal of extensibility the code tends to be rather boring. Imperative coding styles appear to work best for this kind of work. The end result for the code author is all about ensuring the code is as predictable as possible and contains the fewest instructions possible. Fewer instructions means less code to maintain or test (less developer time) and fewer things for the computer to evaluate (performance).

Extend the parser (in practice)

I recently extended Pretty Diff's JavaScript parser to support languages Java, C#, and TypeScript. This extension required some additional work in the beautifier, some additional structure descriptions, limiting use of automatic semicolon insertion, support for new operators, and type annotations (type generics). The big challenges were updating the beautifier to support new code structures not present in JavaScript and parsing the type generics. For the point of this discussion I will focus only upon adding the type generics feature.

Type generics are angle brace delimited tokens similar in appearance to XML tags. Type generics may contain a reference, a comma separated list of references, and may contain a spread operator. They may also be nested so that one type generic token contains another. Unlike XML tags type generics do not contain attributes, end tags, or child nodes.

The complexity in this enhancement is ensuring it does not cause regression. The JavaScript parser already provides support for the < character as a less than operator and also as the start of an XML tag in the case of React JSX language. The Pretty Diff parsers also don't limit support of XML tags embedded within each other, which is required for the supported JSP template language and could cause additional conflict. This enhancement would allow a third and completely unrelated use for the less than character and it must be added in a way that does not break support for the existing two uses.

This ehancement became easier when I realized that, aside from a TypeScript only edge case, type generics would never exist in the same contexts as JSX's XML tags. This reduced the complexity of the effort so that I only had to tell the difference between whether the less than character reperesents a less than operator or the start of a type generic token.

At this point in the lexing process you know exactly where you are and you can look up the parsed tokens to see what has come before, but you don't know what comes next in the code yet. Keeping this in mind I chose to presume the token would be a type generic element opposed to a less than character until I found reason to believe otherwise. If I did find evidence to believe this could not be a type generic element I could easily return out of this logic and push a less than operator into the token array and appropriate descriptions into the other arrays because. This is especially simple since that operator is only one character. I determined inappropriate evidence would be a character reserved for operators or certain other syntax, a greater number of > characters than < characters.

The case collision I hinted at earlier between TypeScript and React JSX in whether these elements immediately follow a return keyword. In TypeScript you can return a type generic element from a function. In React JSX you can return XML tags from a function. In this case I do not have enough contextual information to tell the difference between whether this should be an XML tag or a type generic element. To solve this problem I created a new option to pass into the parser. If the language is assumed to be (or chosen by the user) as TypeScript then an option named typescript is set to true, which prevents the parsing of XML tags from JavaScript like code.

Enjoyment and fulfillment

Writing parsers isn't fun and exciting like writing the next hot video game. It probably won't unlock the keys to the next major computer science breakthrough. Writing parsers will, however, make you a better programmer. Writing parsers will infuse in you a disciplined, conservative, and humble approach to programming. It is often a thankless effort hidden beneath layers of rules and utilities not visible to the end user. When you do well few people will notice. When you break a user's code many will notice and few will forgive you. Writing parsers is not the path to heroism.

I believe writing parsers has made me a better programmer in ways many other programming disciplines could not. I believe I have learned qualities and decisions that differ from many of my software developer peers because the goals are different.

Application predictability is important. Unpredictable outputs often results in harm. Helpful tooling and exciting or clever new conventions and features won't make the code more predictable. The only things that make code more predictable are fewer instructions, clearer flow control paths, separation of concerns, and increased testing. The end result is boring stuff and code put on a diet. Dieting in code is often, for many developers, an unacceptable inconvenience much like dieting in real life.

You are on your own. There might be a team you can turn to help answer questions, or an adviser to guide you through tough decisions, or there might even be a helpful toolkit to assist with testing. None of those things will solve hard problems for you. You will be forced to make decisions. Many of these decisions could mean catastrophic failure that breaks a user's code. Many times you won't have any idea. There isn't going to be a magical framwork out there to make decisions for you. There is no solution in a box to write APIs for you or manage expectations. You are on your own. You have to make hard decisions and when you break a user's code, because it will happen, you have to own those consequences and seek resolution.

Simplicity is different than easiness. Easy means lower developer effort for the developer writing the code, but simplicity means less effort for everybody else to include your users and the processing computer. These terms could not be more different. The job of a competent developer is to make that distinction and solve for it as directly as possible. Nobody will forgive you when you fail. Your users will, however, recognize the additional effort to compensate for the external failures around you if that means increased application dependability or user fulfillment.

Superior programmers are simply people who iterate faster. There is no way to know if you have solved a given problem until you have tested the solution. Once the problem is solved you won't know if you have created additional problems or regression issues without running additional tests. The pattern here is to attempt a solution and then attempt to verify the solution against various validations. The key to success is not some brilliant vision or the magic ability to avoid failure. The key to success is simply speed. There is clearly more to it than that, like a diversity of reliable test cases, but in the end it always comes down to speed. The best way to think about this is that a developer who works 10 times faster than the next developer is allowed to fail 10 times more frequently before releasing code into production. The way to achieve superior speed is to access the problems as directly as possible. Each barrier between you and the problem will slow you down. Common barriers are too many tools, abstractions, build processes, compile steps, frameworks, missing documentation, unclear flow control, and so forth. Increased speed pays unforeseeable compounded dividends over time as you learn to further increase your speed with more precise forms of automation, organization, and planning.

Writing parsers is actually enjoyable and fulfilling. Over time you notice that your discipline and development capabilities increase. Your speed dramatically increases over time as each edge case can quickly consume large portions of your life and mental health. Your ability to perceive large complex problems before they occur and address them with simple elegant solutions is something only you can appreciate, but it applies to many areas of life even outside of software. Given enough practice your ability to write software architectures seems to spotanously arise from nothing when really it is the result of solving many tiny edge cases paired with a modest amount of planning.