Skip to content

Beware of char offset when iterate over a string Has Fix

Description

It's a common pitfall in Rust that counting character offset is not the same as counting byte offset when iterating through a string. Rust string is represented by utf-8 byte array, which is a variable-length encoding scheme.

chars().enumerate() will yield the character offset, while char_indices() will yield the byte offset.

rs
let yes = "y̆es";
let mut char_indices = yes.char_indices();
assert_eq!(Some((0, 'y')), char_indices.next()); // not (0, 'y̆')
assert_eq!(Some((1, '\u{0306}')), char_indices.next());
// note the 3 here - the last character took up two bytes
assert_eq!(Some((3, 'e')), char_indices.next());
assert_eq!(Some((4, 's')), char_indices.next());

Depending on your use case, you may want to use char_indices() instead of chars().enumerate().

Pattern

shell
ast-grep -p '$A.chars().enumerate()' \
   -r '$A.char_indices()' \
   -l rs

Example

rs
for (i, char) in source.chars().enumerate() {
    println!("Boshen is angry :)");
}

Diff

rs
for (i, char) in source.chars().enumerate() { 
for (i, char) in source.char_indices() { 
    println!("Boshen is angry :)");
}

Contributed by

Inspired by Boshen's Tweet

Boshen's footgun

Made with ❤️ with Rust