Deep Dive into ast-grep's Pattern Syntax
ast-grep's pattern is easy to learn but hard to master. While it's easy to get started with, mastering its nuances can greatly enhance your code searching capabilities.
This article aims to provide you with a deep understanding of how ast-grep's patterns are parsed, created, and effectively used in code matching.
Steps to Create a Pattern
Parsing a pattern in ast-grep involves these keys steps:
- Preprocess the pattern text, e.g, replacing
$
with expando_char. - Parse the preprocessed pattern text into AST.
- Extract effective AST nodes based on builtin heuristics or user provided selector.
- Detect AST with wildcard text and convert them into meta variables.
Let's dive deep into each of these steps.
Pattern is AST based
First and foremost, pattern is AST based.
ast-grep's pattern code will be converted into the Abstract Syntax Tree (AST) format, which is a tree structure that represents the code snippet you want to match.
Therefore pattern cannot be arbitrary text, but a valid code with meta variables as placeholders. If the pattern cannot be parsed by the underlying parser tree-sitter, ast-grep won't be able to find valid matching for it.
There are several common pitfalls to avoid when creating patterns.
Invalid Pattern Code
ast-grep pattern must be parsable valid code. While this may seem obvious, newcomers sometimes make mistakes when creating patterns with meta-variables.
Meta-variable is usually parsed as identifier in most languages.
When using meta-variables, make sure they are placed in a valid context and not used as a keyword or an operator. For example, you may want to use $OP
to match binary expressions like a + b
. The pattern below will not work because parsers see it as three consecutive identifiers separated by spaces.
$LEFT $OP $RIGHT
You can instead use atomic rule kind: binary_expression
to match binary expressions.
Similarly, in JavaScript you may want to match object accessors like { get foo() {}, set bar() { } }
. The pattern below will not work because meta-variable is not parsed as the keywords get
and set
.
obj = { $KIND foo() { } }
Again rule is more suitable for this scenario.
rule:
kind: method_definition
regex: '^get|set\s'
Incomplete Pattern Code
It is very common and even attempting to write incomplete code snippet in patterns. However, incomplete code does not always work.
Consider the following JSON code snippet as pattern:
"a": 123
While the intention here is clearly to match a key-value pair, tree-sitter does not treat it as valid JSON code because it is missing the enclosing {}
. Consequently ast-grep will not be able to parse it.
The solution here is to use pattern object to provide complete code snippet.
pattern:
context: '{ "a": 123 }'
selector: pair
You can use both ast-grep playground's pattern tab or rule tab to verify it.
Incomplete pattern code sometimes works fine due to error-tolerance.
For better user experience, ast-grep parse pattern code as lenient as possible. ast-grep parsers will try recovering parsing errors and ignoring missing language constructs.
For example, the pattern foo(bar)
in Java cannot be parsed as valid code. However, ast-grep recover the parsing error, ignoring missing semicolon and treat it as a method call. So the pattern still works.
Ambiguous Pattern Code
Just as programming languages have ambiguous grammar, so ast-grep patterns can be ambiguous.
Let's consider the JavaScript code snippet below:
a: 123
It can be interpreted as an object key-value pair or a labeled statement.
Without other hints, ast-grep will parse it as labeled statement by default. To match object key-value pair, we need to provide more context by using pattern object.
pattern:
context: '{ a: 123 }'
selector: pair
Other examples of ambiguous patterns include:
- Match function call in Golang and C
- Match class field in JavaScript
How ast-grep Handles Pattern Code?
ast-grep uses best efforts to parse pattern code for best user experience.
Here are some strategies ast-grep uses to handle code snippet:
Replace
$
with expando_char: some languages use$
as a special character, so ast-grep replace it with expando_char in order to make the pattern code parsable.Ignore missing nodes: ast-grep will ignore missing nodes in pattern like trailing semicolon in Java/C/C++.
Treat root error as normal node: if the parser error has no siblings, ast-grep will treat it as a normal node.
If all above fails, users should provide more code via pattern object
Pattern Error Recovery is useful, but not guaranteed
ast-grep's recovery mechanism heavily depends on tree-sitter's behavior. We cannot guarantee invalid patterns will be parsed consistently between different versions. So using invalid pattern may lead to unexpected results after upgrading ast-grep.
When in doubt, always use valid code snippets with pattern object.
Extract Effective AST for Pattern
After parsing the pattern code, ast-grep needs to extract AST nodes to make the actual pattern.
Normally, a code snippet generated by tree-sitter will be a full AST tree. Yet it is unlikely that the entire tree will be used as a pattern. The code 123
will produce a tree like program -> expression_statement -> number
in many languages. But we want to match a number literal in the code, not a program containing just a number.
ast-grep uses two strategies to extract effective AST nodes that will be used to match code.
Builtin Heuristic
By default, at-grep extracts the leaf node or the innermost node with more than one child.
This heuristic extracts the most specific node while still keeping all structural information in the pattern. If a node has only one child, it is atomic and cannot be further decomposed. We can safely assume the node contains no structural information for matching. In contrast, a node with more than one child contains a structure that we want to search.
Examples:
123
will be extracted asnumber
because it is the leaf node.
program
expression_statement
number <--- effective node
See Playground.
foo(bar)
will be extracted ascall_expression
because it is the innermost node that has more than one child.
program
expression_statement
call_expression <--- effective node
identifier
arguments
identifier
See Playground.
User Defined Selector
Sometimes the effective node extracted by the builtin heuristic may not be what you want. You can explicitly specify the node to extract using the selector field in the rule configuration.
For example, you may want to match the whole console.log
statement in JavaScript code. The effective node extracted by the builtin heuristic is call_expression
, but you want to match the whole expression_statement
.
Using console.log($$$)
directly will not include the trailing ;
in the pattern, see Playground.
console.log("Hello")
console.log("World");
You can use pattern object to explicitly specify the effective node to be expression_statement
. Playground
pattern:
context: console.log($$$)
selector: expression_statement
Using selector
is especially helpful when you are also using relational rules like follows
and precedes
. You want to match the statement instead of the default inner expression node, and match other statements around it.
TIP
When in doubt, try pattern object first.
Meta Variable Deep Dive
ast-grep's meta variables are also AST based and are detected in the effective nodes extracted from the pattern code.
Meta Variable Detection in Pattern
Not all $
prefixed strings will be detected as meta variables.
Only AST nodes that match meta variable syntax will be detected. If meta variable text is not the only text in the node or it spans multiple nodes, it will not be detected as a meta variable.
Working meta variable examples:
$A
works$A
is one singleidentifier
$A.$B
works$A
isidentifier
insidemember_expression
$B
is theproperty_identifier
.
$A.method($B)
works$A
isidentifier
insidemember_expression
$B
isidentifier
insidearguments
Non working meta variable examples:
obj.on$EVENT
does not workon$EVENT
isproperty_identifier
but$EVENT
is not the only text
"Hello $WORLD"
does not work$WORLD
is insidestring_content
and is not the only text
a $OP b
does not work- the whole pattern does not parse
$jq
does not work- meta variable does not accept lower case letters
See all examples in Playground.
Matching Unnamed Nodes
A meta variable pattern $META
will capture named nodes by default. To capture unnamed nodes, you can use double dollar sign $$VAR
.
Let's go back to the binary expression example. It is impossible to match arbitrary binary expression in one single pattern. But we can combine kind
and has
to match the operator in binary expressions.
Note, $OP
cannot match the operator because operator is not a named node. We need to use $$OP
instead.
rule:
kind: binary_expression
has:
field: operator
pattern: $$OP
# pattern: $OP
See the above rule to match all arithmetic expressions in action.
How Multi Meta Variables Match Code
Multiple meta variables like $$$ARGS
has special matching behavior. It will match multiple nodes in the AST.
$$$ARGS
will match multiple nodes in source code when the meta variable starts to match. It will match as many nodes as possible until the first AST node after the meta var in pattern is matched.
The behavior is like non-greedy matching in regex and template string literal infer
in TypeScript.
Use ast-grep playground to debug pattern
ast-grep playground is a great tool to debug pattern code. The pattern tab and pattern panel can help you visualize the AST tree, effective nodes and meta variables.
In next article, we will explain how ast-grep's pattern is used to match code, the pattern matching algorithm.