Skip to content

Deep Dive into ast-grep's Pattern

One key highlight of ast-grep is its pattern.

Pattern is a convenient way to write and read expressions that describe syntax trees. It resembles code, but with some special syntax and semantics that allow you to match parts of a syntax tree based on their structure, type or content.

While ast-grep's pattern is easy to learn, it is hard to master. It requires you to know the Tree-sitter grammar and meaning of the target language, as well as the rules and conventions of ast-grep.

In this guide, we will help you grasp the core concepts of ast-grep's pattern that are common to all languages. We will also show you how to leverage the full power of ast-grep pattern for your own usage.

What is Tree-sitter?

ast-grep is using Tree-sitter as its underlying parsing framework due to its popularity, performance and robustness.

Tree-sitter is a tool that generates parsers and provides an incremental parsing library.

A parser is a program that takes a source code file as input and produces a tree structure that describes the organization of the code. (Contrary to ast-grep's name, the tree structure is not abstract syntax tree, as we will see later).

Writing good parsers for various programming languages is a laborious task, if even possible, for one single project like ast-grep. Fortunately, Tree-sitter is a venerable and popular tool that has a wide community support. Many mainstream languages such as C, Java, JavaScript, Python, Rust, and more are supported by Tree-sitter. Using Tree-sitter as ast-grep's underlying parsing library allows it to work with any language that has a well-maintained grammar available.

Another perk of Tree-sitter is its incremental nature. An incremental parser is a parser that can update the syntax tree efficiently when the source code file is edited, without having to re-parse the entire file. It can run very fast on every code changes in ast-greps' interactive editing.

Finally, Tree-sitter also handles syntax errors gracefully, and it can parse multiple languages within the same file. This makes pattern code more robust to parse and easier to write. In future we can also support multi-language source code like Vue.

Textual vs Structural

When you use ast-grep to search for patterns in source code, you need to understand the difference between textual and structural matching.

Source code input is text, a sequence of characters that follows certain syntax rules. You can use common search tools like silver-searcher or ripgrep to search for text patterns in source code.

However, ast-grep does not match patterns against the text directly. Instead, it parses the text into a tree structure that represents the syntax of the code. This allows ast-grep to match patterns based on the semantic meaning of the code, not just its surface appearance. This is known as structural search, which searches for code with a specific structure, not just a specific text.

Therefore, the patterns you write must also be of valid syntax that can be compared with the code tree.

Textual Search in ast-grep

Though pattern structurally matches code, you can use the atomic rule regex to matches the text of a node by specifying a regular expression. This way, it is possible to combine textual and structural matching in ast-grep.

AST vs CST

To represent the syntax and semantics of code, we have two types of tree structures: AST and CST.

AST stands for Abstract Syntax Tree, which is a simplified representation of the code that omits some details like punctuation and whitespaces. CST stands for Concrete Syntax Tree, which is a more faithful representation of the code that includes all the details.

Tree-sitter is a library that can parse code into CSTs for many programming languages. Thusly, ast-grep, contrary to its name, searches and rewrites code based on CST patterns, instead of AST.

Let's walk through an example to see why CST makes more sense. Consider the JavaScript snippet 1 + 1. Its AST representation looks like this:

binary_expression
  number
  number

An astute reader should notice the important operator + is not encoded in AST. Meanwhile, its CST faithfully represents all critical information.

binary_expression
  number
  +                # note this + operator!
  number

You might wonder if using CST will make trivial whitespaces affect your search results. Fortunately, ast-grep uses a smart matching algorithm that can skip trivial nodes in CST when appropriate, which saves you a lot of trouble.

Named vs Unnamed

It is possible to convert CST to AST if we don't care about punctuation and whitespaces. Tree-sitter has two types of nodes: named nodes and unnamed nodes(anonymous nodes).

The more important named nodes are defined with a regular name in the grammar rules, such as binary_expression or identifier. The less important unnamed nodes are defined with literal strings such as "," or "+".

Named nodes are more important for understanding the code's structure and meaning, while unnamed nodes are less important and can be sometimes skipped by ast-grep's matching algorithms.

The following example, adapted from Tree-sitter's official guide, shows the difference in grammar definition.

javascript
rules: {
  // named nodes are defined with the format `kind: parseRule`
  identifier: $ => /[a-z]+/,
  // binary_expression is also a named node,
  // the `+` operator is defined with a string literal, so it is an unnamed node
  binary_expression: $ => seq($.identifier, '+', $.identifier),
                                          // ↑ unnamed node
}

Practically, named nodes have a property called kind that indicates their names. You can use ast-grep's atomic rule kind to find the specific AST node. Playground link for the example below.

yaml
rule:
  kind: binary_expression
# matches `1 + 1`

Further more, ast-grep's meta variable matches only named nodes by default. return $A matches only the first statement below. Playground link.

js
return 123 // `123` is named `number` and matched.
return;    // `;` is unnamed and not matched.

We can use double dollar $$VAR to include unnamed nodes in the pattern result. return $$A will match both statement above. Playground link.

Kind vs Field

Sometimes, using kind alone is not enough to find the nodes we want. A node may have several children with the same kind, but different roles in the code. For example, in JavaScript, an object may have multiple keys and values, all with the string kind.

To distinguish them, we can use field to specify the relation between a node and its parent. In ast-grep, field can be specified in two relational rules: has and inside.

has and inside accept a special configuration item called field. The value of field is the field name of the parent-child relation. For example, the key-value pair in JavaScript object has two children: one with field key and the other with field value. We can use this rule to match the key node of kind string.

yaml
rule:
  kind: string
  inside:
    field: key
    kind: pair

field can help us to narrow down the search scope and make the pattern more precise.

We can also use has to rewrite the rule above, searching the key-value pair with string key. Playground link.

yaml
rule:
  kind: pair
  has:
    field: key
    kind: string

Key Difference between kind and field

  • kind is the property of the node itself. Only named nodes have kinds.
  • field is the property of the relation between parent and child. Unnamed nodes can also have fields.

It might be confusing to new users that a node has both kind and field. kind belongs to the node itself, represented by blue text in ast-grep's playground. Child node has a field only relative to its parent, and vice-versa. field is represented by dark yellow text in the playground. Since field is a property of a node relation, unnamed nodes can also have field. For example, the + in the binary expression 1 + 1 has the field operator.

Significant vs Trivial

ast-grep goes further beyond Tree-sitter. It has a concept about the "significance" of a node.

  • If a node is a named node or has a field relative to its parent, it is a significant node.
  • Otherwise, the node is a trivial node.

Even significance is not enough

Most Tree-sitter languages do not encode all critical semantics in AST, the tree with named nodes only. Even significant nodes are not sufficient to represent the meaning of code. We have to preserve some trivial nodes for precise matching.

Tree-sitter parsers do not encode all semantics with named nodes. For example, class A { get method() {} } and class A { method() {} } are equivalent in Tree-sitter's AST. The critical token get is not named nor has a field name. It is a trivial node!

If you do not care about if the method is a getter method, a static method or an instance method, you can use class $A { method() {} } to match all the three methods at once. Alternatively, you can fully spell out the method modifier if you need to tell getter method from normal method.

Summary

Thank you for reading until here! There are many concepts in this article. Let's summarize them in one paragraph.

ast-grep uses Tree-sitter to parse textual source code into a detailed tree structure called CST. We can get AST from CST by only keeping named nodes, which have kinds. To search nodes in a syntax tree, you can use both node kind and node field, which is a special role of a child node relative to its parent node. A node with either a kind or a field is a significant node.

Made with ❤️ with Rust