Search Multi-language Documents in ast-grep
Introduction
ast-grep works well searching files of one single language, but it is hard to extract a sub language embedded inside a document.
However, in modern development, it's common to encounter multi-language documents. These are source files containing code written in multiple different languages. Notable examples include:
- HTML files: These can contain JavaScript inside
<script>
tags and CSS inside<style>
tags. - JavaScript files: These often contain regular expression, CSS style and query languages like graphql.
- Ruby files: These can contain snippets of code inside heredoc literals, where the heredoc delimiter often indicates the language.
These multi-language documents can be modeled in terms of a parent syntax tree with one or more injected syntax trees residing inside certain nodes of the parent tree.
ast-grep now supports a feature to handle language injection, allowing you to search for code written in one language within documents of another language.
This concept and terminology come from tree-sitter's language injection, which implies you can inject another language into a language document. (BTW, neovim also embraces this terminology.)
Example: Search JS/CSS in the CLI
Let's start with a simple example of searching for JavaScript and CSS within HTML files using ast-grep's command-line interface (CLI). ast-grep has builtin support to search JavaScript and CSS inside HTML files.
Using sg run
: find patterns of CSS in an HTML file
Suppose we have an HTML file like below:
<style>
h1 { color: red; }
</style>
<h1>
Hello World!
</h1>
<script>
alert('hello world!')
</script>
Running this ast-grep command will extract the matching CSS style code out of the HTML file!
sg run -p 'color: $COLOR'
ast-grep outputs this beautiful CLI report.
test.html
2│ h1 { color: red; }
ast-grep works well even if just providing the pattern without specifying the pattern language!
Using sg scan
: find JavaScript in HTML with rule files
You can also use ast-grep's rule file to search injected languages.
For example, we can warn the use of alert
in JavaScript, even if it is inside the HTML file.
id: no-alert
language: JavaScript
severity: warning
rule:
pattern: alert($MSG)
message: Prefer use appropriate custom UI instead of obtrusive alert call.
The rule above will detect usage of alert
in JavaScript. Running the rule via sg scan
.
sg scan --rule no-alert.yml
The command leverages built-in behaviors in ast-grep to handle language injection seamlessly. It will produce the following warning message for the HTML file above.
warning[no-alert]: Prefer use appropriate custom UI instead of obtrusive alert call.
┌─ test.html:8:3
│
8 │ alert('hello world!')
│ ^^^^^^^^^^^^^^^^^^^^^
How language injections work?
ast-grep employs a multi-step process to handle language injections effectively. Here's a detailed breakdown of the workflow:
File Discovery: The CLI first discovers files on the disk via the venerable ignore crate, the same library under ripgrep's hood.
Language Inference: ast-grep infers the language of each discovered file based on file extensions.
Injection Extraction: For documents that contain code written in multiple languages (e.g., HTML with embedded JS), ast-grep extracts the injected language sub-regions. At the moment, ast-grep handles HTML/JS/CSS natively.
Code Matching: ast-grep matches the specified patterns or rules against these regions. Pattern code will be interpreted according to the injected language (e.g. JS/CSS), instead of the parent document language (e.g. HTML).
Customize Language Injection: styled-components in JavaScript
You can customize language injection via the sgconfig.yml
configuration file. This allows you to specify how ast-grep handles multi-language documents based on your specific needs, without modifying ast-grep's built-in behaviors.
Let's see an example of searching CSS code in JavaScript. styled-components is a library for styling React applications using CSS-in-JS. It allows you to write CSS directly within your JavaScript via tagged template literals, creating styled elements as React components.
The example will configure ast-grep to detect styled-components' CSS.
Injection Configuration
You can add the languageInjections
section in the project configuration file sgconfig.yml
.
languageInjections:
- hostLanguage: js
rule:
pattern: styled.$TAG`$CONTENT`
injected: css
Let's break the configuration down.
hostLanguage
: Specifies the main language of the document. In this example, it is set tojs
(JavaScript).rule
: Defines the ast-grep rule to identify the injected language region within the host language.pattern
: The pattern matches styled components syntax wherestyled
is followed by a tag (e.g.,button
,div
) and a template literal containing CSS.- the rule should have a meta variable
$CONTENT
to specify the subregion of injected language. In this case, it is the content inside the template string.
injected
: Specifies the injected language within the identified regions. In this case, it iscss
.
Example Match
Consider a JSX file using styled components:
import styled from 'styled-components';
const Button = styled.button`
background: red;
color: white;
padding: 10px 20px;
border-radius: 3px;
`
exporrt default function App() {
return <Button>Click Me</Button>
}
With the above languageInjections
configuration, ast-grep will:
- Identify the
styled.button
block as a CSS region. - Extract the CSS code inside the template literal.
- Apply any CSS-specific pattern searches within this extracted region.
You can search the CSS inside JavaScript in the project configuration folder using this command:
sg -p 'background: $COLOR' -C 2
It will produce the match result:
styled.js
2│
3│const Button = styled.button`
4│ background: red;
5│ color: white;
6│ padding: 10px 20px;
Using Custom Language with Injection
Finally, let's look at an example of searching for GraphQL within JavaScript files. This demonstrates ast-grep's flexibility in handling custom language injections.
Define graphql custom language in sgconfig.yml
.
First, we need to register graphql as a custom language in ast-grep. See custom language reference for more details.
customLanguages:
graphql:
libraryPath: graphql.so # the graphql tree-sitter parser dynamic library
extensions: [graphql] # graphql file extension
expandoChar: $ # see reference above for explanation
Define graphql injection in sgconfig.yml
.
Next, we need to customize what region should be parsed as graphql string in JavaScript. This is similar to styled-components example above.
languageInjections:
- hostLanguage: js
rule:
pattern: graphql`$CONTENT`
injected: graphql
Search GraphQL in JavaScript
Suppose we have this JavaScript file from Relay, a GraphQL client framework.
import React from "react"
import { graphql } from "react-relay"
const artistsQuery = graphql`
query ArtistQuery($artistID: String!) {
artist(id: $artistID) {
name
...ArtistDescription_artist
}
}
`
We can search the GraphQL fragment via this --inline-rules
scan.
sg scan --inline-rules="{id: test, language: graphql, rule: {kind: fragment_spread}}"
Output
help[test]:
┌─ relay.js:8:7
│
8 │ ...ArtistDescription_artist
│ ^^^^^^^^^^^^^^^^^^^^^^^^^^^
More Possibility to be Unlocked...
By following these steps, you can effectively use ast-grep to search and analyze code across multiple languages within the same document, enhancing your ability to manage and understand complex codebases.
This feature extends to various frameworks like Vue and Svelte, enables searching for SQL in React server actions, and supports new patterns like Vue-Vine.
Hope you enjoy the feature! Happy ast-grepping!