regress/
lib.rs

1/*!
2
3# regress - REGex in Rust with EcmaScript Syntax
4
5This crate provides a regular expression engine which targets EcmaScript (aka JavaScript) regular expression syntax.
6
7# Example: test if a string contains a match
8
9```rust
10use regress::Regex;
11let re = Regex::new(r"\d{4}").unwrap();
12let matched = re.find("2020-20-05").is_some();
13assert!(matched);
14```
15
16# Example: iterating over matches
17
18Here we use a backreference to find doubled characters:
19
20```rust
21use regress::Regex;
22let re = Regex::new(r"(\w)\1").unwrap();
23let text = "Frankly, Miss Piggy, I don't give a hoot!";
24for m in re.find_iter(text) {
25    println!("{}", &text[m.range()])
26}
27// Output: ss
28// Output: gg
29// Output: oo
30
31```
32
33# Example: using capture groups
34
35Capture groups are available in the `Match` object produced by a successful match.
36A capture group is a range of byte indexes into the original string.
37
38```rust
39use regress::Regex;
40let re = Regex::new(r"(\d{4})").unwrap();
41let text = "Today is 2020-20-05";
42let m = re.find(text).unwrap();
43let group = m.group(1).unwrap();
44println!("Year: {}", &text[group]);
45// Output: Year: 2020
46```
47
48# Supported Syntax
49
50regress targets ES 2018 syntax. You can refer to the many resources about JavaScript regex syntax.
51
52There are some features which have yet to be implemented:
53
54- Named character classes liks `[[:alpha:]]`
55- Unicode property escapes like `\p{Sc}`
56
57Note the parser assumes the `u` (Unicode) flag, as the non-Unicode path is tied to JS's UCS-2 string encoding and the semantics cannot be usefully expressed in Rust.
58
59# Unicode remarks
60
61regress supports Unicode case folding. For example:
62
63```rust
64use regress::Regex;
65let re = Regex::with_flags("\u{00B5}", "i").unwrap();
66assert!(re.find("\u{03BC}").is_some());
67```
68
69Here the U+00B5 (micro sign) was case-insensitively matched against U+03BC (small letter mu).
70
71regress does NOT perform normalization. For example,  e-with-accute-accent can be precomposed or decomposed, and these are treated as not equivalent:
72
73```rust
74use regress::{Regex, Flags};
75let re = Regex::new("\u{00E9}").unwrap();
76assert!(re.find("\u{0065}\u{0301}").is_none());
77```
78
79This agrees with JavaScript semantics. Perform any required normalization before regex matching.
80
81## Ascii matching
82
83regress has an "ASCII mode" which treats each 8-bit quantity as a separate character.
84This may provide improved performance if you do not need Unicode semantics, because it can avoid decoding UTF-8 and has simpler (ASCII-only) case-folding.
85
86Example:
87
88```rust
89use regress::Regex;
90let re = Regex::with_flags("BC", "i").unwrap();
91assert!(re.find("abcd").is_some());
92```
93
94
95# Comparison to regex crate
96
97regress supports features that regex does not, in particular backreferences and zero-width lookaround assertions.
98However the regex crate provides linear-time matching guarantees, while regress does not. This difference is due
99to the architecture: regex uses finite automata while regress uses "classical backtracking."
100
101# Comparison to fancy-regex crate
102
103fancy-regex wraps the regex crate and extends it with PCRE-style syntactic features. regress has more complete support for these features: backreferences may be case-insensitive, and lookbehinds may be arbitrary-width.
104
105# Architecture
106
107regress has a parser, intermediate representation, optimizer which acts on the IR, bytecode emitter, and two bytecode interpreters, referred to as "backends".
108
109The major interpreter is the "classical backtracking" which uses an explicit backtracking stack, similar to JS implementations. There is also the "PikeVM" pseudo-toy backend which is mainly used for testing and verification.
110
111# Crate features
112
113- **utf16**. When enabled, additional APIs are made available that allow matching text formatted in UTF-16 and UCS-2 (`&[u16]`) without going through a conversion to and from UTF-8 (`&str`) first. This is particularly useful when interacting with and/or (re)implementing existing systems that use those encodings, such as JavaScript, Windows, and the JVM.
114
115*/
116
117#![cfg_attr(not(feature = "std"), no_std)]
118#![warn(clippy::all)]
119#![allow(clippy::upper_case_acronyms, clippy::match_like_matches_macro)]
120// Clippy's manual_range_contains suggestion produces worse codegen.
121#![allow(clippy::manual_range_contains)]
122
123#[cfg(not(feature = "std"))]
124#[macro_use]
125extern crate alloc;
126
127pub use crate::api::*;
128
129#[macro_use]
130mod util;
131
132mod api;
133mod bytesearch;
134mod charclasses;
135mod classicalbacktrack;
136mod codepointset;
137mod cursor;
138mod emit;
139mod exec;
140mod indexing;
141mod insn;
142mod ir;
143mod matchers;
144mod optimizer;
145mod parse;
146mod position;
147mod scm;
148mod startpredicate;
149mod types;
150mod unicode;
151mod unicodetables;
152
153#[cfg(feature = "backend-pikevm")]
154mod pikevm;