regex-1.12.2/.cargo_vcs_info.json0000644000000001360000000000100122310ustar { "git": { "sha1": "5ea3eb1e95f0338e283f5f0b4681f0891a1cd836" }, "path_in_vcs": "" }regex-1.12.2/.gitignore000064400000000000000000000001041046102023000130040ustar 00000000000000target Cargo.lock bench-log .*.swp wiki tags examples/debug.rs tmp/ regex-1.12.2/.vim/coc-settings.json000064400000000000000000000001241046102023000151640ustar 00000000000000{ "rust-analyzer.linkedProjects": [ "fuzz/Cargo.toml", "Cargo.toml" ] } regex-1.12.2/CHANGELOG.md000064400000000000000000002116671046102023000126470ustar 000000000000001.12.2 (2025-10-13) =================== This release fixes a `cargo doc` breakage on nightly when `--cfg docsrs` is enabled. This caused documentation to fail to build on docs.rs. Bug fixes: * [BUG #1305](https://github.com/rust-lang/regex/issues/1305): Switches the `doc_auto_cfg` feature to `doc_cfg` on nightly for docs.rs builds. 1.12.1 (2025-10-10) =================== This release makes a bug fix in the new `regex::Captures::get_match` API introduced in `1.12.0`. There was an oversight with the lifetime parameter for the `Match` returned. This is technically a breaking change, but given that it was caught almost immediately and I've yanked the `1.12.0` release, I think this is fine. 1.12.0 (2025-10-10) =================== This release contains a smattering of bug fixes, a fix for excessive memory consumption in some cases and a new `regex::Captures::get_match` API. Improvements: * [FEATURE #1146](https://github.com/rust-lang/regex/issues/1146): Add `Capture::get_match` for returning the overall match without `unwrap()`. Bug fixes: * [BUG #1083](https://github.com/rust-lang/regex/issues/1083): Fixes a panic in the lazy DFA (can only occur for especially large regexes). * [BUG #1116](https://github.com/rust-lang/regex/issues/1116): Fixes a memory usage regression for large regexes (introduced in `regex 1.9`). * [BUG #1195](https://github.com/rust-lang/regex/issues/1195): Fix universal start states in sparse DFA. * [BUG #1295](https://github.com/rust-lang/regex/pull/1295): Fixes a panic when deserializing a corrupted dense DFA. * [BUG 8f5d9479](https://github.com/rust-lang/regex/commit/8f5d9479d0f1da5726488a530d7fd66a73d05b80): Make `regex_automata::meta::Regex::find` consistently return `None` when `WhichCaptures::None` is used. 1.11.3 (2025-09-25) =================== This is a small patch release with an improvement in memory usage in some cases. Improvements: * [BUG #1297](https://github.com/rust-lang/regex/issues/1297): Improve memory usage by trimming excess memory capacity in some spots. 1.11.2 (2025-08-24) =================== This is a new patch release of `regex` with some minor fixes. A larger number of typo or lint fix patches were merged. Also, we now finally recommend using `std::sync::LazyLock`. Improvements: * [BUG #1217](https://github.com/rust-lang/regex/issues/1217): Switch recommendation from `once_cell` to `std::sync::LazyLock`. * [BUG #1225](https://github.com/rust-lang/regex/issues/1225): Add `DFA::set_prefilter` to `regex-automata`. Bug fixes: * [BUG #1165](https://github.com/rust-lang/regex/pull/1150): Remove `std` dependency from `perf-literal-multisubstring` crate feature. * [BUG #1165](https://github.com/rust-lang/regex/pull/1165): Clarify the meaning of `(?R)$` in the documentation. * [BUG #1281](https://github.com/rust-lang/regex/pull/1281): Remove `fuzz/` and `record/` directories from published crate on crates.io. 1.11.1 (2024-10-24) =================== This is a new patch release of `regex` that fixes compilation on nightly Rust when the unstable `pattern` crate feature is enabled. Users on nightly Rust without this feature enabled are unaffected. Bug fixes: * [BUG #1231](https://github.com/rust-lang/regex/issues/1231): Fix the `Pattern` trait implementation as a result of nightly API breakage. 1.11.0 (2024-09-29) =================== This is a new minor release of `regex` that brings in an update to the Unicode Character Database. Specifically, this updates the Unicode data used by `regex` internally to the version 16 release. New features: * [FEATURE #1228](https://github.com/rust-lang/regex/pull/1228): Add new `regex::SetMatches::matched_all` method. * [FEATURE #1229](https://github.com/rust-lang/regex/pull/1229): Update to Unicode Character Database (UCD) version 16. 1.10.6 (2024-08-02) =================== This is a new patch release with a fix for the `unstable` crate feature that enables `std::str::Pattern` trait integration. Bug fixes: * [BUG #1219](https://github.com/rust-lang/regex/pull/1219): Fix the `Pattern` trait implementation as a result of nightly API breakage. 1.10.5 (2024-06-09) =================== This is a new patch release with some minor fixes. Bug fixes: * [BUG #1203](https://github.com/rust-lang/regex/pull/1203): Escape invalid UTF-8 when in the `Debug` impl of `regex::bytes::Match`. 1.10.4 (2024-03-22) =================== This is a new patch release with some minor fixes. * [BUG #1169](https://github.com/rust-lang/regex/issues/1169): Fixes a bug with compiling a reverse NFA automaton in `regex-automata`. * [BUG #1178](https://github.com/rust-lang/regex/pull/1178): Clarifies that when `Cow::Borrowed` is returned from replace APIs, it is equivalent to the input. 1.10.3 (2024-01-21) =================== This is a new patch release that fixes the feature configuration of optional dependencies, and fixes an unsound use of bounds check elision. Bug fixes: * [BUG #1147](https://github.com/rust-lang/regex/issues/1147): Set `default-features=false` for the `memchr` and `aho-corasick` dependencies. * [BUG #1154](https://github.com/rust-lang/regex/pull/1154): Fix unsound bounds check elision. 1.10.2 (2023-10-16) =================== This is a new patch release that fixes a search regression where incorrect matches could be reported. Bug fixes: * [BUG #1110](https://github.com/rust-lang/regex/issues/1110): Revert broadening of reverse suffix literal optimization introduced in 1.10.1. 1.10.1 (2023-10-14) =================== This is a new patch release with a minor increase in the number of valid patterns and a broadening of some literal optimizations. New features: * [FEATURE 04f5d7be](https://github.com/rust-lang/regex/commit/04f5d7be4efc542864cc400f5d43fbea4eb9bab6): Loosen ASCII-compatible rules such that regexes like `(?-u:โ˜ƒ)` are now allowed. Performance improvements: * [PERF 8a8d599f](https://github.com/rust-lang/regex/commit/8a8d599f9d2f2d78e9ad84e4084788c2d563afa5): Broader the reverse suffix optimization to apply in more cases. 1.10.0 (2023-10-09) =================== This is a new minor release of `regex` that adds support for start and end word boundary assertions. That is, `\<` and `\>`. The minimum supported Rust version has also been raised to 1.65, which was released about one year ago. The new word boundary assertions are: * `\<` or `\b{start}`: a Unicode start-of-word boundary (`\W|\A` on the left, `\w` on the right). * `\>` or `\b{end}`: a Unicode end-of-word boundary (`\w` on the left, `\W|\z` on the right). * `\b{start-half}`: half of a Unicode start-of-word boundary (`\W|\A` on the left). * `\b{end-half}`: half of a Unicode end-of-word boundary (`\W|\z` on the right). The `\<` and `\>` are GNU extensions to POSIX regexes. They have been added to the `regex` crate because they enjoy somewhat broad support in other regex engines as well (for example, vim). The `\b{start}` and `\b{end}` assertions are aliases for `\<` and `\>`, respectively. The `\b{start-half}` and `\b{end-half}` assertions are not found in any other regex engine (although regex engines with general look-around support can certainly express them). They were added principally to support the implementation of word matching in grep programs, where one generally wants to be a bit more flexible in what is considered a word boundary. New features: * [FEATURE #469](https://github.com/rust-lang/regex/issues/469): Add support for `\<` and `\>` word boundary assertions. * [FEATURE(regex-automata) #1031](https://github.com/rust-lang/regex/pull/1031): DFAs now have a `start_state` method that doesn't use an `Input`. Performance improvements: * [PERF #1051](https://github.com/rust-lang/regex/pull/1051): Unicode character class operations have been optimized in `regex-syntax`. * [PERF #1090](https://github.com/rust-lang/regex/issues/1090): Make patterns containing lots of literal characters use less memory. Bug fixes: * [BUG #1046](https://github.com/rust-lang/regex/issues/1046): Fix a bug that could result in incorrect match spans when using a Unicode word boundary and searching non-ASCII strings. * [BUG(regex-syntax) #1047](https://github.com/rust-lang/regex/issues/1047): Fix panics that can occur in `Ast->Hir` translation (not reachable from `regex` crate). * [BUG(regex-syntax) #1088](https://github.com/rust-lang/regex/issues/1088): Remove guarantees in the API that connect the `u` flag with a specific HIR representation. `regex-automata` breaking change release: This release includes a `regex-automata 0.4.0` breaking change release, which was necessary in order to support the new word boundary assertions. For example, the `Look` enum has new variants and the `LookSet` type now uses `u32` instead of `u16` to represent a bitset of look-around assertions. These are overall very minor changes, and most users of `regex-automata` should be able to move to `0.4` from `0.3` without any changes at all. `regex-syntax` breaking change release: This release also includes a `regex-syntax 0.8.0` breaking change release, which, like `regex-automata`, was necessary in order to support the new word boundary assertions. This release also includes some changes to the `Ast` type to reduce heap usage in some cases. If you are using the `Ast` type directly, your code may require some minor modifications. Otherwise, users of `regex-syntax 0.7` should be able to migrate to `0.8` without any code changes. `regex-lite` release: The `regex-lite 0.1.1` release contains support for the new word boundary assertions. There are no breaking changes. 1.9.6 (2023-09-30) ================== This is a patch release that fixes a panic that can occur when the default regex size limit is increased to a large number. * [BUG aa4e4c71](https://github.com/rust-lang/regex/commit/aa4e4c7120b0090ce0624e3c42a2ed06dd8b918a): Fix a bug where computing the maximum haystack length for the bounded backtracker could result underflow and thus provoke a panic later in a search due to a broken invariant. 1.9.5 (2023-09-02) ================== This is a patch release that hopefully mostly fixes a performance bug that occurs when sharing a regex across multiple threads. Issue [#934](https://github.com/rust-lang/regex/issues/934) explains this in more detail. It is [also noted in the crate documentation](https://docs.rs/regex/latest/regex/#sharing-a-regex-across-threads-can-result-in-contention). The bug can appear when sharing a regex across multiple threads simultaneously, as might be the case when using a regex from a `OnceLock`, `lazy_static` or similar primitive. Usually high contention only results when using many threads to execute searches on small haystacks. One can avoid the contention problem entirely through one of two methods. The first is to use lower level APIs from `regex-automata` that require passing state explicitly, such as [`meta::Regex::search_with`](https://docs.rs/regex-automata/latest/regex_automata/meta/struct.Regex.html#method.search_with). The second is to clone a regex and send it to other threads explicitly. This will not use any additional memory usage compared to sharing the regex. The only downside of this approach is that it may be less convenient, for example, it won't work with things like `OnceLock` or `lazy_static` or `once_cell`. With that said, as of this release, the contention performance problems have been greatly reduced. This was achieved by changing the free-list so that it was sharded across threads, and that ensuring each sharded mutex occupies a single cache line to mitigate false sharing. So while contention may still impact performance in some cases, it should be a lot better now. Because of the changes to how the free-list works, please report any issues you find with this release. That not only includes search time regressions but also significant regressions in memory usage. Reporting improvements is also welcome as well! If possible, provide a reproduction. Bug fixes: * [BUG #934](https://github.com/rust-lang/regex/issues/934): Fix a performance bug where high contention on a single regex led to massive slow-downs. 1.9.4 (2023-08-26) ================== This is a patch release that fixes a bug where `RegexSet::is_match(..)` could incorrectly return false (even when `RegexSet::matches(..).matched_any()` returns true). Bug fixes: * [BUG #1070](https://github.com/rust-lang/regex/issues/1070): Fix a bug where a prefilter was incorrectly configured for a `RegexSet`. 1.9.3 (2023-08-05) ================== This is a patch release that fixes a bug where some searches could result in incorrect match offsets being reported. It is difficult to characterize the types of regexes susceptible to this bug. They generally involve patterns that contain no prefix or suffix literals, but have an inner literal along with a regex prefix that can conditionally match. Bug fixes: * [BUG #1060](https://github.com/rust-lang/regex/issues/1060): Fix a bug with the reverse inner literal optimization reporting incorrect match offsets. 1.9.2 (2023-08-05) ================== This is a patch release that fixes another memory usage regression. This particular regression occurred only when using a `RegexSet`. In some cases, much more heap memory (by one or two orders of magnitude) was allocated than in versions prior to 1.9.0. Bug fixes: * [BUG #1059](https://github.com/rust-lang/regex/issues/1059): Fix a memory usage regression when using a `RegexSet`. 1.9.1 (2023-07-07) ================== This is a patch release which fixes a memory usage regression. In the regex 1.9 release, one of the internal engines used a more aggressive allocation strategy than what was done previously. This patch release reverts to the prior on-demand strategy. Bug fixes: * [BUG #1027](https://github.com/rust-lang/regex/issues/1027): Change the allocation strategy for the backtracker to be less aggressive. 1.9.0 (2023-07-05) ================== This release marks the end of a [years long rewrite of the regex crate internals](https://github.com/rust-lang/regex/issues/656). Since this is such a big release, please report any issues or regressions you find. We would also love to hear about improvements as well. In addition to many internal improvements that should hopefully result in "my regex searches are faster," there have also been a few API additions: * A new `Captures::extract` method for quickly accessing the substrings that match each capture group in a regex. * A new inline flag, `R`, which enables CRLF mode. This makes `.` match any Unicode scalar value except for `\r` and `\n`, and also makes `(?m:^)` and `(?m:$)` match after and before both `\r` and `\n`, respectively, but never between a `\r` and `\n`. * `RegexBuilder::line_terminator` was added to further customize the line terminator used by `(?m:^)` and `(?m:$)` to be any arbitrary byte. * The `std` Cargo feature is now actually optional. That is, the `regex` crate can be used without the standard library. * Because `regex 1.9` may make binary size and compile times even worse, a new experimental crate called `regex-lite` has been published. It prioritizes binary size and compile times over functionality (like Unicode) and performance. It shares no code with the `regex` crate. New features: * [FEATURE #244](https://github.com/rust-lang/regex/issues/244): One can opt into CRLF mode via the `R` flag. e.g., `(?mR:$)` matches just before `\r\n`. * [FEATURE #259](https://github.com/rust-lang/regex/issues/259): Multi-pattern searches with offsets can be done with `regex-automata 0.3`. * [FEATURE #476](https://github.com/rust-lang/regex/issues/476): `std` is now an optional feature. `regex` may be used with only `alloc`. * [FEATURE #644](https://github.com/rust-lang/regex/issues/644): `RegexBuilder::line_terminator` configures how `(?m:^)` and `(?m:$)` behave. * [FEATURE #675](https://github.com/rust-lang/regex/issues/675): Anchored search APIs are now available in `regex-automata 0.3`. * [FEATURE #824](https://github.com/rust-lang/regex/issues/824): Add new `Captures::extract` method for easier capture group access. * [FEATURE #961](https://github.com/rust-lang/regex/issues/961): Add `regex-lite` crate with smaller binary sizes and faster compile times. * [FEATURE #1022](https://github.com/rust-lang/regex/pull/1022): Add `TryFrom` implementations for the `Regex` type. Performance improvements: * [PERF #68](https://github.com/rust-lang/regex/issues/68): Added a one-pass DFA engine for faster capture group matching. * [PERF #510](https://github.com/rust-lang/regex/issues/510): Inner literals are now used to accelerate searches, e.g., `\w+@\w+` will scan for `@`. * [PERF #787](https://github.com/rust-lang/regex/issues/787), [PERF #891](https://github.com/rust-lang/regex/issues/891): Makes literal optimizations apply to regexes of the form `\b(foo|bar|quux)\b`. (There are many more performance improvements as well, but not all of them have specific issues devoted to them.) Bug fixes: * [BUG #429](https://github.com/rust-lang/regex/issues/429): Fix matching bugs related to `\B` and inconsistencies across internal engines. * [BUG #517](https://github.com/rust-lang/regex/issues/517): Fix matching bug with capture groups. * [BUG #579](https://github.com/rust-lang/regex/issues/579): Fix matching bug with word boundaries. * [BUG #779](https://github.com/rust-lang/regex/issues/779): Fix bug where some regexes like `(re)+` were not equivalent to `(re)(re)*`. * [BUG #850](https://github.com/rust-lang/regex/issues/850): Fix matching bug inconsistency between NFA and DFA engines. * [BUG #921](https://github.com/rust-lang/regex/issues/921): Fix matching bug where literal extraction got confused by `$`. * [BUG #976](https://github.com/rust-lang/regex/issues/976): Add documentation to replacement routines about dealing with fallibility. * [BUG #1002](https://github.com/rust-lang/regex/issues/1002): Use corpus rejection in fuzz testing. 1.8.4 (2023-06-05) ================== This is a patch release that fixes a bug where `(?-u:\B)` was allowed in Unicode regexes, despite the fact that the current matching engines can report match offsets between the code units of a single UTF-8 encoded codepoint. That in turn means that match offsets that split a codepoint could be reported, which in turn results in panicking when one uses them to slice a `&str`. This bug occurred in the transition to `regex 1.8` because the underlying syntactical error that prevented this regex from compiling was intentionally removed. That's because `(?-u:\B)` will be permitted in Unicode regexes in `regex 1.9`, but the matching engines will guarantee to never report match offsets that split a codepoint. When the underlying syntactical error was removed, no code was added to ensure that `(?-u:\B)` didn't compile in the `regex 1.8` transition release. This release, `regex 1.8.4`, adds that code such that `Regex::new(r"(?-u:\B)")` returns to the `regex <1.8` behavior of not compiling. (A `bytes::Regex` can still of course compile it.) Bug fixes: * [BUG #1006](https://github.com/rust-lang/regex/issues/1006): Fix a bug where `(?-u:\B)` was allowed in Unicode regexes, and in turn could lead to match offsets that split a codepoint in `&str`. 1.8.3 (2023-05-25) ================== This is a patch release that fixes a bug where the regex would report a match at every position even when it shouldn't. This could occur in a very small subset of regexes, usually an alternation of simple literals that have particular properties. (See the issue linked below for a more precise description.) Bug fixes: * [BUG #999](https://github.com/rust-lang/regex/issues/999): Fix a bug where a match at every position is erroneously reported. 1.8.2 (2023-05-22) ================== This is a patch release that fixes a bug where regex compilation could panic in debug mode for regexes with large counted repetitions. For example, `a{2147483516}{2147483416}{5}` resulted in an integer overflow that wrapped in release mode but panicking in debug mode. Despite the unintended wrapping arithmetic in release mode, it didn't cause any other logical bugs since the errant code was for new analysis that wasn't used yet. Bug fixes: * [BUG #995](https://github.com/rust-lang/regex/issues/995): Fix a bug where regex compilation with large counted repetitions could panic. 1.8.1 (2023-04-21) ================== This is a patch release that fixes a bug where a regex match could be reported where none was found. Specifically, the bug occurs when a pattern contains some literal prefixes that could be extracted _and_ an optional word boundary in the prefix. Bug fixes: * [BUG #981](https://github.com/rust-lang/regex/issues/981): Fix a bug where a word boundary could interact with prefix literal optimizations and lead to a false positive match. 1.8.0 (2023-04-20) ================== This is a sizeable release that will be soon followed by another sizeable release. Both of them will combined close over 40 existing issues and PRs. This first release, despite its size, essentially represents preparatory work for the second release, which will be even bigger. Namely, this release: * Increases the MSRV to Rust 1.60.0, which was released about 1 year ago. * Upgrades its dependency on `aho-corasick` to the recently released 1.0 version. * Upgrades its dependency on `regex-syntax` to the simultaneously released `0.7` version. The changes to `regex-syntax` principally revolve around a rewrite of its literal extraction code and a number of simplifications and optimizations to its high-level intermediate representation (HIR). The second release, which will follow ~shortly after the release above, will contain a soup-to-nuts rewrite of every regex engine. This will be done by bringing [`regex-automata`](https://github.com/BurntSushi/regex-automata) into this repository, and then changing the `regex` crate to be nothing but an API shim layer on top of `regex-automata`'s API. These tandem releases are the culmination of about 3 years of on-and-off work that [began in earnest in March 2020](https://github.com/rust-lang/regex/issues/656). Because of the scale of changes involved in these releases, I would love to hear about your experience. Especially if you notice undocumented changes in behavior or performance changes (positive *or* negative). Most changes in the first release are listed below. For more details, please see the commit log, which reflects a linear and decently documented history of all changes. New features: * [FEATURE #501](https://github.com/rust-lang/regex/issues/501): Permit many more characters to be escaped, even if they have no significance. More specifically, any ASCII character except for `[0-9A-Za-z<>]` can now be escaped. Also, a new routine, `is_escapeable_character`, has been added to `regex-syntax` to query whether a character is escapable or not. * [FEATURE #547](https://github.com/rust-lang/regex/issues/547): Add `Regex::captures_at`. This fills a hole in the API, but doesn't otherwise introduce any new expressive power. * [FEATURE #595](https://github.com/rust-lang/regex/issues/595): Capture group names are now Unicode-aware. They can now begin with either a `_` or any "alphabetic" codepoint. After the first codepoint, subsequent codepoints can be any sequence of alphanumeric codepoints, along with `_`, `.`, `[` and `]`. Note that replacement syntax has not changed. * [FEATURE #810](https://github.com/rust-lang/regex/issues/810): Add `Match::is_empty` and `Match::len` APIs. * [FEATURE #905](https://github.com/rust-lang/regex/issues/905): Add an `impl Default for RegexSet`, with the default being the empty set. * [FEATURE #908](https://github.com/rust-lang/regex/issues/908): A new method, `Regex::static_captures_len`, has been added which returns the number of capture groups in the pattern if and only if every possible match always contains the same number of matching groups. * [FEATURE #955](https://github.com/rust-lang/regex/issues/955): Named captures can now be written as `(?re)` in addition to `(?Pre)`. * FEATURE: `regex-syntax` now supports empty character classes. * FEATURE: `regex-syntax` now has an optional `std` feature. (This will come to `regex` in the second release.) * FEATURE: The `Hir` type in `regex-syntax` has had a number of simplifications made to it. * FEATURE: `regex-syntax` has support for a new `R` flag for enabling CRLF mode. This will be supported in `regex` proper in the second release. * FEATURE: `regex-syntax` now has proper support for "regex that never matches" via `Hir::fail()`. * FEATURE: The `hir::literal` module of `regex-syntax` has been completely re-worked. It now has more documentation, examples and advice. * FEATURE: The `allow_invalid_utf8` option in `regex-syntax` has been renamed to `utf8`, and the meaning of the boolean has been flipped. Performance improvements: * PERF: The upgrade to `aho-corasick 1.0` may improve performance in some cases. It's difficult to characterize exactly which patterns this might impact, but if there are a small number of longish (>= 4 bytes) prefix literals, then it might be faster than before. Bug fixes: * [BUG #514](https://github.com/rust-lang/regex/issues/514): Improve `Debug` impl for `Match` so that it doesn't show the entire haystack. * BUGS [#516](https://github.com/rust-lang/regex/issues/516), [#731](https://github.com/rust-lang/regex/issues/731): Fix a number of issues with printing `Hir` values as regex patterns. * [BUG #610](https://github.com/rust-lang/regex/issues/610): Add explicit example of `foo|bar` in the regex syntax docs. * [BUG #625](https://github.com/rust-lang/regex/issues/625): Clarify that `SetMatches::len` does not (regrettably) refer to the number of matches in the set. * [BUG #660](https://github.com/rust-lang/regex/issues/660): Clarify "verbose mode" in regex syntax documentation. * BUG [#738](https://github.com/rust-lang/regex/issues/738), [#950](https://github.com/rust-lang/regex/issues/950): Fix `CaptureLocations::get` so that it never panics. * [BUG #747](https://github.com/rust-lang/regex/issues/747): Clarify documentation for `Regex::shortest_match`. * [BUG #835](https://github.com/rust-lang/regex/issues/835): Fix `\p{Sc}` so that it is equivalent to `\p{Currency_Symbol}`. * [BUG #846](https://github.com/rust-lang/regex/issues/846): Add more clarifying documentation to the `CompiledTooBig` error variant. * [BUG #854](https://github.com/rust-lang/regex/issues/854): Clarify that `regex::Regex` searches as if the haystack is a sequence of Unicode scalar values. * [BUG #884](https://github.com/rust-lang/regex/issues/884): Replace `__Nonexhaustive` variants with `#[non_exhaustive]` attribute. * [BUG #893](https://github.com/rust-lang/regex/pull/893): Optimize case folding since it can get quite slow in some pathological cases. * [BUG #895](https://github.com/rust-lang/regex/issues/895): Reject `(?-u:\W)` in `regex::Regex` APIs. * [BUG #942](https://github.com/rust-lang/regex/issues/942): Add a missing `void` keyword to indicate "no parameters" in C API. * [BUG #965](https://github.com/rust-lang/regex/issues/965): Fix `\p{Lc}` so that it is equivalent to `\p{Cased_Letter}`. * [BUG #975](https://github.com/rust-lang/regex/issues/975): Clarify documentation for `\pX` syntax. 1.7.3 (2023-03-24) ================== This is a small release that fixes a bug in `Regex::shortest_match_at` that could cause it to panic, even when the offset given is valid. Bug fixes: * [BUG #969](https://github.com/rust-lang/regex/issues/969): Fix a bug in how the reverse DFA was called for `Regex::shortest_match_at`. 1.7.2 (2023-03-21) ================== This is a small release that fixes a failing test on FreeBSD. Bug fixes: * [BUG #967](https://github.com/rust-lang/regex/issues/967): Fix "no stack overflow" test which can fail due to the small stack size. 1.7.1 (2023-01-09) ================== This release was done principally to try and fix the doc.rs rendering for the regex crate. Performance improvements: * [PERF #930](https://github.com/rust-lang/regex/pull/930): Optimize `replacen`. This also applies to `replace`, but not `replace_all`. Bug fixes: * [BUG #945](https://github.com/rust-lang/regex/issues/945): Maybe fix rustdoc rendering by just bumping a new release? 1.7.0 (2022-11-05) ================== This release principally includes an upgrade to Unicode 15. New features: * [FEATURE #832](https://github.com/rust-lang/regex/issues/916): Upgrade to Unicode 15. 1.6.0 (2022-07-05) ================== This release principally includes an upgrade to Unicode 14. New features: * [FEATURE #832](https://github.com/rust-lang/regex/pull/832): Clarify that `Captures::len` includes all groups, not just matching groups. * [FEATURE #857](https://github.com/rust-lang/regex/pull/857): Add an `ExactSizeIterator` impl for `SubCaptureMatches`. * [FEATURE #861](https://github.com/rust-lang/regex/pull/861): Improve `RegexSet` documentation examples. * [FEATURE #877](https://github.com/rust-lang/regex/issues/877): Upgrade to Unicode 14. Bug fixes: * [BUG #792](https://github.com/rust-lang/regex/issues/792): Fix error message rendering bug. 1.5.6 (2022-05-20) ================== This release includes a few bug fixes, including a bug that produced incorrect matches when a non-greedy `?` operator was used. * [BUG #680](https://github.com/rust-lang/regex/issues/680): Fixes a bug where `[[:alnum:][:^ascii:]]` dropped `[:alnum:]` from the class. * [BUG #859](https://github.com/rust-lang/regex/issues/859): Fixes a bug where `Hir::is_match_empty` returned `false` for `\b`. * [BUG #862](https://github.com/rust-lang/regex/issues/862): Fixes a bug where 'ab??' matches 'ab' instead of 'a' in 'ab'. 1.5.5 (2022-03-08) ================== This releases fixes a security bug in the regex compiler. This bug permits a vector for a denial-of-service attack in cases where the regex being compiled is untrusted. There are no known problems where the regex is itself trusted, including in cases of untrusted haystacks. * [SECURITY #GHSA-m5pq-gvj9-9vr8](https://github.com/rust-lang/regex/security/advisories/GHSA-m5pq-gvj9-9vr8): Fixes a bug in the regex compiler where empty sub-expressions subverted the existing mitigations in place to enforce a size limit on compiled regexes. The Rust Security Response WG published an advisory about this: https://groups.google.com/g/rustlang-security-announcements/c/NcNNL1Jq7Yw 1.5.4 (2021-05-06) ================== This release fixes another compilation failure when building regex. This time, the fix is for when the `pattern` feature is enabled, which only works on nightly Rust. CI has been updated to test this case. * [BUG #772](https://github.com/rust-lang/regex/pull/772): Fix build when `pattern` feature is enabled. 1.5.3 (2021-05-01) ================== This releases fixes a bug when building regex with only the `unicode-perl` feature. It turns out that while CI was building this configuration, it wasn't actually failing the overall build on a failed compilation. * [BUG #769](https://github.com/rust-lang/regex/issues/769): Fix build in `regex-syntax` when only the `unicode-perl` feature is enabled. 1.5.2 (2021-05-01) ================== This release fixes a performance bug when Unicode word boundaries are used. Namely, for certain regexes on certain inputs, it's possible for the lazy DFA to stop searching (causing a fallback to a slower engine) when it doesn't actually need to. [PR #768](https://github.com/rust-lang/regex/pull/768) fixes the bug, which was originally reported in [ripgrep#1860](https://github.com/BurntSushi/ripgrep/issues/1860). 1.5.1 (2021-04-30) ================== This is a patch release that fixes a compilation error when the `perf-literal` feature is not enabled. 1.5.0 (2021-04-30) ================== This release primarily updates to Rust 2018 (finally) and bumps the MSRV to Rust 1.41 (from Rust 1.28). Rust 1.41 was chosen because it's still reasonably old, and is what's in Debian stable at the time of writing. This release also drops this crate's own bespoke substring search algorithms in favor of a new [`memmem` implementation provided by the `memchr` crate](https://docs.rs/memchr/2.4.0/memchr/memmem/index.html). This will change the performance profile of some regexes, sometimes getting a little worse, and hopefully more frequently, getting a lot better. Please report any serious performance regressions if you find them. 1.4.6 (2021-04-22) ================== This is a small patch release that fixes the compiler's size check on how much heap memory a regex uses. Previously, the compiler did not account for the heap usage of Unicode character classes. Now it does. It's possible that this may make some regexes fail to compile that previously did compile. If that happens, please file an issue. * [BUG OSS-fuzz#33579](https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=33579): Some regexes can use more heap memory than one would expect. 1.4.5 (2021-03-14) ================== This is a small patch release that fixes a regression in the size of a `Regex` in the 1.4.4 release. Prior to 1.4.4, a `Regex` was 552 bytes. In the 1.4.4 release, it was 856 bytes due to internal changes. In this release, a `Regex` is now 16 bytes. In general, the size of a `Regex` was never something that was on my radar, but this increased size in the 1.4.4 release seems to have crossed a threshold and resulted in stack overflows in some programs. * [BUG #750](https://github.com/rust-lang/regex/pull/750): Fixes stack overflows seemingly caused by a large `Regex` size by decreasing its size. 1.4.4 (2021-03-11) ================== This is a small patch release that contains some bug fixes. Notably, it also drops the `thread_local` (and `lazy_static`, via transitivity) dependencies. Bug fixes: * [BUG #362](https://github.com/rust-lang/regex/pull/362): Memory leaks caused by an internal caching strategy should now be fixed. * [BUG #576](https://github.com/rust-lang/regex/pull/576): All regex types now implement `UnwindSafe` and `RefUnwindSafe`. * [BUG #728](https://github.com/rust-lang/regex/pull/749): Add missing `Replacer` impls for `Vec`, `String`, `Cow`, etc. 1.4.3 (2021-01-08) ================== This is a small patch release that adds some missing standard trait implementations for some types in the public API. Bug fixes: * [BUG #734](https://github.com/rust-lang/regex/pull/734): Add `FusedIterator` and `ExactSizeIterator` impls to iterator types. * [BUG #735](https://github.com/rust-lang/regex/pull/735): Add missing `Debug` impls to public API types. 1.4.2 (2020-11-01) ================== This is a small bug fix release that bans `\P{any}`. We previously banned empty classes like `[^\w\W]`, but missed the `\P{any}` case. In the future, we hope to permit empty classes. * [BUG #722](https://github.com/rust-lang/regex/issues/722): Ban `\P{any}` to avoid a panic in the regex compiler. Found by OSS-Fuzz. 1.4.1 (2020-10-13) ================== This is a small bug fix release that makes `\p{cf}` work. Previously, it would report "property not found" even though `cf` is a valid abbreviation for the `Format` general category. * [BUG #719](https://github.com/rust-lang/regex/issues/719): Fixes bug that prevented `\p{cf}` from working. 1.4.0 (2020-10-11) ================== This releases has a few minor documentation fixes as well as some very minor API additions. The MSRV remains at Rust 1.28 for now, but this is intended to increase to at least Rust 1.41.1 soon. This release also adds support for OSS-Fuzz. Kudos to [@DavidKorczynski](https://github.com/DavidKorczynski) for doing the heavy lifting for that! New features: * [FEATURE #649](https://github.com/rust-lang/regex/issues/649): Support `[`, `]` and `.` in capture group names. * [FEATURE #687](https://github.com/rust-lang/regex/issues/687): Add `is_empty` predicate to `RegexSet`. * [FEATURE #689](https://github.com/rust-lang/regex/issues/689): Implement `Clone` for `SubCaptureMatches`. * [FEATURE #715](https://github.com/rust-lang/regex/issues/715): Add `empty` constructor to `RegexSet` for convenience. Bug fixes: * [BUG #694](https://github.com/rust-lang/regex/issues/694): Fix doc example for `Replacer::replace_append`. * [BUG #698](https://github.com/rust-lang/regex/issues/698): Clarify docs for `s` flag when using a `bytes::Regex`. * [BUG #711](https://github.com/rust-lang/regex/issues/711): Clarify `is_match` docs to indicate that it can match anywhere in string. 1.3.9 (2020-05-28) ================== This release fixes a MSRV (Minimum Support Rust Version) regression in the 1.3.8 release. Namely, while 1.3.8 compiles on Rust 1.28, it actually does not compile on other Rust versions, such as Rust 1.39. Bug fixes: * [BUG #685](https://github.com/rust-lang/regex/issues/685): Remove use of `doc_comment` crate, which cannot be used before Rust 1.43. 1.3.8 (2020-05-28) ================== This release contains a couple of important bug fixes driven by better support for empty-subexpressions in regexes. For example, regexes like `b|` are now allowed. Major thanks to [@sliquister](https://github.com/sliquister) for implementing support for this in [#677](https://github.com/rust-lang/regex/pull/677). Bug fixes: * [BUG #523](https://github.com/rust-lang/regex/pull/523): Add note to documentation that spaces can be escaped in `x` mode. * [BUG #524](https://github.com/rust-lang/regex/issues/524): Add support for empty sub-expressions, including empty alternations. * [BUG #659](https://github.com/rust-lang/regex/issues/659): Fix match bug caused by an empty sub-expression miscompilation. 1.3.7 (2020-04-17) ================== This release contains a small bug fix that fixes how `regex` forwards crate features to `regex-syntax`. In particular, this will reduce recompilations in some cases. Bug fixes: * [BUG #665](https://github.com/rust-lang/regex/pull/665): Fix feature forwarding to `regex-syntax`. 1.3.6 (2020-03-24) ================== This release contains a sizable (~30%) performance improvement when compiling some kinds of large regular expressions. Performance improvements: * [PERF #657](https://github.com/rust-lang/regex/pull/657): Improvement performance of compiling large regular expressions. 1.3.5 (2020-03-12) ================== This release updates this crate to Unicode 13. New features: * [FEATURE #653](https://github.com/rust-lang/regex/pull/653): Update `regex-syntax` to Unicode 13. 1.3.4 (2020-01-30) ================== This is a small bug fix release that fixes a bug related to the scoping of flags in a regex. Namely, before this fix, a regex like `((?i)a)b)` would match `aB` despite the fact that `b` should not be matched case insensitively. Bug fixes: * [BUG #640](https://github.com/rust-lang/regex/issues/640): Fix bug related to the scoping of flags in a regex. 1.3.3 (2020-01-09) ================== This is a small maintenance release that upgrades the dependency on `thread_local` from `0.3` to `1.0`. The minimum supported Rust version remains at Rust 1.28. 1.3.2 (2020-01-09) ================== This is a small maintenance release with some house cleaning and bug fixes. New features: * [FEATURE #631](https://github.com/rust-lang/regex/issues/631): Add a `Match::range` method an a `From for Range` impl. Bug fixes: * [BUG #521](https://github.com/rust-lang/regex/issues/521): Corrects `/-/.splitn("a", 2)` to return `["a"]` instead of `["a", ""]`. * [BUG #594](https://github.com/rust-lang/regex/pull/594): Improve error reporting when writing `\p\`. * [BUG #627](https://github.com/rust-lang/regex/issues/627): Corrects `/-/.split("a-")` to return `["a", ""]` instead of `["a"]`. * [BUG #633](https://github.com/rust-lang/regex/pull/633): Squash deprecation warnings for the `std::error::Error::description` method. 1.3.1 (2019-09-04) ================== This is a maintenance release with no changes in order to try to work around a [docs.rs/Cargo issue](https://github.com/rust-lang/docs.rs/issues/400). 1.3.0 (2019-09-03) ================== This release adds a plethora of new crate features that permit users of regex to shrink its size considerably, in exchange for giving up either functionality (such as Unicode support) or runtime performance. When all such features are disabled, the dependency tree for `regex` shrinks to exactly 1 crate (`regex-syntax`). More information about the new crate features can be [found in the docs](https://docs.rs/regex/*/#crate-features). Note that while this is a new minor version release, the minimum supported Rust version for this crate remains at `1.28.0`. New features: * [FEATURE #474](https://github.com/rust-lang/regex/issues/474): The `use_std` feature has been deprecated in favor of the `std` feature. The `use_std` feature will be removed in regex 2. Until then, `use_std` will remain as an alias for the `std` feature. * [FEATURE #583](https://github.com/rust-lang/regex/issues/583): Add a substantial number of crate features shrinking `regex`. 1.2.1 (2019-08-03) ================== This release does a bit of house cleaning. Namely: * This repository is now using rustfmt. * License headers have been removed from all files, in following suit with the Rust project. * Teddy has been removed from the `regex` crate, and is now part of the `aho-corasick` crate. [See `aho-corasick`'s new `packed` submodule for details](https://docs.rs/aho-corasick/0.7.6/aho_corasick/packed/index.html). * The `utf8-ranges` crate has been deprecated, with its functionality moving into the [`utf8` sub-module of `regex-syntax`](https://docs.rs/regex-syntax/0.6.11/regex_syntax/utf8/index.html). * The `ucd-util` dependency has been dropped, in favor of implementing what little we need inside of `regex-syntax` itself. In general, this is part of an ongoing (long term) effort to make optimizations in the regex engine easier to reason about. The current code is too convoluted, and thus it is very easy to introduce new bugs. This simplification effort is the primary motivation behind re-working the `aho-corasick` crate to not only bundle algorithms like Teddy, but to also provide regex-like match semantics automatically. Moving forward, the plan is to join up with the `bstr` and `regex-automata` crates, with the former providing more sophisticated substring search algorithms (thereby deleting existing code in `regex`) and the latter providing ahead-of-time compiled DFAs for cases where they are inexpensive to compute. 1.2.0 (2019-07-20) ================== This release updates regex's minimum supported Rust version to 1.28, which was release almost 1 year ago. This release also updates regex's Unicode data tables to 12.1.0. 1.1.9 (2019-07-06) ================== This release contains a bug fix that caused regex's tests to fail, due to a dependency on an unreleased behavior in regex-syntax. * [BUG #593](https://github.com/rust-lang/regex/issues/593): Move an integration-style test on error messages into regex-syntax. 1.1.8 (2019-07-04) ================== This release contains a few small internal refactorings. One of which fixes an instance of undefined behavior in a part of the SIMD code. Bug fixes: * [BUG #545](https://github.com/rust-lang/regex/issues/545): Improves error messages when a repetition operator is used without a number. * [BUG #588](https://github.com/rust-lang/regex/issues/588): Removes use of a repr(Rust) union used for type punning in the Teddy matcher. * [BUG #591](https://github.com/rust-lang/regex/issues/591): Update docs for running benchmarks and improve failure modes. 1.1.7 (2019-06-09) ================== This release fixes up a few warnings as a result of recent deprecations. 1.1.6 (2019-04-16) ================== This release fixes a regression introduced by a bug fix (for [BUG #557](https://github.com/rust-lang/regex/issues/557)) which could cause the regex engine to enter an infinite loop. This bug was originally [reported against ripgrep](https://github.com/BurntSushi/ripgrep/issues/1247). 1.1.5 (2019-04-01) ================== This release fixes a bug in regex's dependency specification where it requires a newer version of regex-syntax, but this wasn't communicated correctly in the Cargo.toml. This would have been caught by a minimal version check, but this check was disabled because the `rand` crate itself advertises incorrect dependency specifications. Bug fixes: * [BUG #570](https://github.com/rust-lang/regex/pull/570): Fix regex-syntax minimal version. 1.1.4 (2019-03-31) ================== This release fixes a backwards compatibility regression where Regex was no longer UnwindSafe. This was caused by the upgrade to aho-corasick 0.7, whose AhoCorasick type was itself not UnwindSafe. This has been fixed in aho-corasick 0.7.4, which we now require. Bug fixes: * [BUG #568](https://github.com/rust-lang/regex/pull/568): Fix an API regression where Regex was no longer UnwindSafe. 1.1.3 (2019-03-30) ================== This releases fixes a few bugs and adds a performance improvement when a regex is a simple alternation of literals. Performance improvements: * [OPT #566](https://github.com/rust-lang/regex/pull/566): Upgrades `aho-corasick` to 0.7 and uses it for `foo|bar|...|quux` regexes. Bug fixes: * [BUG #527](https://github.com/rust-lang/regex/issues/527): Fix a bug where the parser would panic on patterns like `((?x))`. * [BUG #555](https://github.com/rust-lang/regex/issues/555): Fix a bug where the parser would panic on patterns like `(?m){1,1}`. * [BUG #557](https://github.com/rust-lang/regex/issues/557): Fix a bug where captures could lead to an incorrect match. 1.1.2 (2019-02-27) ================== This release fixes a bug found in the fix introduced in 1.1.1. Bug fixes: * [BUG edf45e6f](https://github.com/rust-lang/regex/commit/edf45e6f): Fix bug introduced in reverse suffix literal matcher in the 1.1.1 release. 1.1.1 (2019-02-27) ================== This is a small release with one fix for a bug caused by literal optimizations. Bug fixes: * [BUG 661bf53d](https://github.com/rust-lang/regex/commit/661bf53d): Fixes a bug in the reverse suffix literal optimization. This was originally reported [against ripgrep](https://github.com/BurntSushi/ripgrep/issues/1203). 1.1.0 (2018-11-30) ================== This is a small release with a couple small enhancements. This release also increases the minimal supported Rust version (MSRV) to 1.24.1 (from 1.20.0). In accordance with this crate's MSRV policy, this release bumps the minor version number. Performance improvements: * [OPT #511](https://github.com/rust-lang/regex/pull/511), [OPT #540](https://github.com/rust-lang/regex/pull/540): Improve lazy DFA construction for large regex sets. New features: * [FEATURE #538](https://github.com/rust-lang/regex/pull/538): Add Emoji and "break" Unicode properties. See [UNICODE.md](UNICODE.md). Bug fixes: * [BUG #530](https://github.com/rust-lang/regex/pull/530): Add Unicode license (for data tables). * Various typo/doc fixups. 1.0.6 (2018-11-06) ================== This is a small release. Performance improvements: * [OPT #513](https://github.com/rust-lang/regex/pull/513): Improve performance of compiling large Unicode classes by 8-10%. Bug fixes: * [BUG #533](https://github.com/rust-lang/regex/issues/533): Fix definition of `[[:blank:]]` class that regressed in `regex-syntax 0.5`. 1.0.5 (2018-09-06) ================== This is a small release with an API enhancement. New features: * [FEATURE #509](https://github.com/rust-lang/regex/pull/509): Generalize impls of the `Replacer` trait. 1.0.4 (2018-08-25) ================== This is a small release that bumps the quickcheck dependency. 1.0.3 (2018-08-24) ================== This is a small bug fix release. Bug fixes: * [BUG #504](https://github.com/rust-lang/regex/pull/504): Fix for Cargo's "minimal version" support. * [BUG 1e39165f](https://github.com/rust-lang/regex/commit/1e39165f): Fix doc examples for byte regexes. 1.0.2 (2018-07-18) ================== This release exposes some new lower level APIs on `Regex` that permit amortizing allocation and controlling the location at which a search is performed in a more granular way. Most users of the regex crate will not need or want to use these APIs. New features: * [FEATURE #493](https://github.com/rust-lang/regex/pull/493): Add a few lower level APIs for amortizing allocation and more fine-grained searching. Bug fixes: * [BUG 3981d2ad](https://github.com/rust-lang/regex/commit/3981d2ad): Correct outdated documentation on `RegexBuilder::dot_matches_new_line`. * [BUG 7ebe4ae0](https://github.com/rust-lang/regex/commit/7ebe4ae0): Correct outdated documentation on `Parser::allow_invalid_utf8` in the `regex-syntax` crate. * [BUG 24c7770b](https://github.com/rust-lang/regex/commit/24c7770b): Fix a bug in the HIR printer where it wouldn't correctly escape meta characters in character classes. 1.0.1 (2018-06-19) ================== This release upgrades regex's Unicode tables to Unicode 11, and enables SIMD optimizations automatically on Rust stable (1.27 or newer). New features: * [FEATURE #486](https://github.com/rust-lang/regex/pull/486): Implement `size_hint` on `RegexSet` match iterators. * [FEATURE #488](https://github.com/rust-lang/regex/pull/488): Update Unicode tables for Unicode 11. * [FEATURE #490](https://github.com/rust-lang/regex/pull/490): SIMD optimizations are now enabled automatically in Rust stable, for versions 1.27 and up. No compilation flags or features need to be set. CPU support SIMD is detected automatically at runtime. Bug fixes: * [BUG #482](https://github.com/rust-lang/regex/pull/482): Present a better compilation error when the `use_std` feature isn't used. 1.0.0 (2018-05-01) ================== This release marks the 1.0 release of regex. While this release includes some breaking changes, most users of older versions of the regex library should be able to migrate to 1.0 by simply bumping the version number. The important changes are as follows: * We adopt Rust 1.20 as the new minimum supported version of Rust for regex. We also tentatively adopt a policy that permits bumping the minimum supported version of Rust in minor version releases of regex, but no patch releases. That is, with respect to semver, we do not strictly consider bumping the minimum version of Rust to be a breaking change, but adopt a conservative stance as a compromise. * Octal syntax in regular expressions has been disabled by default. This permits better error messages that inform users that backreferences aren't available. Octal syntax can be re-enabled via the corresponding option on `RegexBuilder`. * `(?-u:\B)` is no longer allowed in Unicode regexes since it can match at invalid UTF-8 code unit boundaries. `(?-u:\b)` is still allowed in Unicode regexes. * The `From` impl has been removed. This formally removes the public dependency on `regex-syntax`. * A new feature, `use_std`, has been added and enabled by default. Disabling the feature will result in a compilation error. In the future, this may permit us to support `no_std` environments (w/ `alloc`) in a backwards compatible way. For more information and discussion, please see [1.0 release tracking issue](https://github.com/rust-lang/regex/issues/457). 0.2.11 (2018-05-01) =================== This release primarily contains bug fixes. Some of them resolve bugs where the parser could panic. New features: * [FEATURE #459](https://github.com/rust-lang/regex/pull/459): Include C++'s standard regex library and Boost's regex library in the benchmark harness. We now include D/libphobos, C++/std, C++/boost, Oniguruma, PCRE1, PCRE2, RE2 and Tcl in the harness. Bug fixes: * [BUG #445](https://github.com/rust-lang/regex/issues/445): Clarify order of indices returned by RegexSet match iterator. * [BUG #461](https://github.com/rust-lang/regex/issues/461): Improve error messages for invalid regexes like `[\d-a]`. * [BUG #464](https://github.com/rust-lang/regex/issues/464): Fix a bug in the error message pretty printer that could cause a panic when a regex contained a literal `\n` character. * [BUG #465](https://github.com/rust-lang/regex/issues/465): Fix a panic in the parser that was caused by applying a repetition operator to `(?flags)`. * [BUG #466](https://github.com/rust-lang/regex/issues/466): Fix a bug where `\pC` was not recognized as an alias for `\p{Other}`. * [BUG #470](https://github.com/rust-lang/regex/pull/470): Fix a bug where literal searches did more work than necessary for anchored regexes. 0.2.10 (2018-03-16) =================== This release primarily updates the regex crate to changes made in `std::arch` on nightly Rust. New features: * [FEATURE #458](https://github.com/rust-lang/regex/pull/458): The `Hir` type in `regex-syntax` now has a printer. 0.2.9 (2018-03-12) ================== This release introduces a new nightly only feature, `unstable`, which enables SIMD optimizations for certain types of regexes. No additional compile time options are necessary, and the regex crate will automatically choose the best CPU features at run time. As a result, the `simd` (nightly only) crate dependency has been dropped. New features: * [FEATURE #456](https://github.com/rust-lang/regex/pull/456): The regex crate now includes AVX2 optimizations in addition to the extant SSSE3 optimization. Bug fixes: * [BUG #455](https://github.com/rust-lang/regex/pull/455): Fix a bug where `(?x)[ / - ]` failed to parse. 0.2.8 (2018-03-12) ================== Bug fixes: * [BUG #454](https://github.com/rust-lang/regex/pull/454): Fix a bug in the nest limit checker being too aggressive. 0.2.7 (2018-03-07) ================== This release includes a ground-up rewrite of the regex-syntax crate, which has been in development for over a year. 731 New features: * Error messages for invalid regexes have been greatly improved. You get these automatically; you don't need to do anything. In addition to better formatting, error messages will now explicitly call out the use of look around. When regex 1.0 is released, this will happen for backreferences as well. * Full support for intersection, difference and symmetric difference of character classes. These can be used via the `&&`, `--` and `~~` binary operators within classes. * A Unicode Level 1 conformant implementation of `\p{..}` character classes. Things like `\p{scx:Hira}`, `\p{age:3.2}` or `\p{Changes_When_Casefolded}` now work. All property name and value aliases are supported, and properties are selected via loose matching. e.g., `\p{Greek}` is the same as `\p{G r E e K}`. * A new `UNICODE.md` document has been added to this repository that exhaustively documents support for UTS#18. * Empty sub-expressions are now permitted in most places. That is, `()+` is now a valid regex. * Almost everything in regex-syntax now uses constant stack space, even when performing analysis that requires structural induction. This reduces the risk of a user provided regular expression causing a stack overflow. * [FEATURE #174](https://github.com/rust-lang/regex/issues/174): The `Ast` type in `regex-syntax` now contains span information. * [FEATURE #424](https://github.com/rust-lang/regex/issues/424): Support `\u`, `\u{...}`, `\U` and `\U{...}` syntax for specifying code points in a regular expression. * [FEATURE #449](https://github.com/rust-lang/regex/pull/449): Add a `Replace::by_ref` adapter for use of a replacer without consuming it. Bug fixes: * [BUG #446](https://github.com/rust-lang/regex/issues/446): We re-enable the Boyer-Moore literal matcher. 0.2.6 (2018-02-08) ================== Bug fixes: * [BUG #446](https://github.com/rust-lang/regex/issues/446): Fixes a bug in the new Boyer-Moore searcher that results in a match failure. We fix this bug by temporarily disabling Boyer-Moore. 0.2.5 (2017-12-30) ================== Bug fixes: * [BUG #437](https://github.com/rust-lang/regex/issues/437): Fixes a bug in the new Boyer-Moore searcher that results in a panic. 0.2.4 (2017-12-30) ================== New features: * [FEATURE #348](https://github.com/rust-lang/regex/pull/348): Improve performance for capture searches on anchored regex. (Contributed by @ethanpailes. Nice work!) * [FEATURE #419](https://github.com/rust-lang/regex/pull/419): Expand literal searching to include Tuned Boyer-Moore in some cases. (Contributed by @ethanpailes. Nice work!) Bug fixes: * [BUG](https://github.com/rust-lang/regex/pull/436): The regex compiler plugin has been removed. * [BUG](https://github.com/rust-lang/regex/pull/436): `simd` has been bumped to `0.2.1`, which fixes a Rust nightly build error. * [BUG](https://github.com/rust-lang/regex/pull/436): Bring the benchmark harness up to date. 0.2.3 (2017-11-30) ================== New features: * [FEATURE #374](https://github.com/rust-lang/regex/pull/374): Add `impl From for &str`. * [FEATURE #380](https://github.com/rust-lang/regex/pull/380): Derive `Clone` and `PartialEq` on `Error`. * [FEATURE #400](https://github.com/rust-lang/regex/pull/400): Update to Unicode 10. Bug fixes: * [BUG #375](https://github.com/rust-lang/regex/issues/375): Fix a bug that prevented the bounded backtracker from terminating. * [BUG #393](https://github.com/rust-lang/regex/issues/393), [BUG #394](https://github.com/rust-lang/regex/issues/394): Fix bug with `replace` methods for empty matches. 0.2.2 (2017-05-21) ================== New features: * [FEATURE #341](https://github.com/rust-lang/regex/issues/341): Support nested character classes and intersection operation. For example, `[\p{Greek}&&\pL]` matches greek letters and `[[0-9]&&[^4]]` matches every decimal digit except `4`. (Much thanks to @robinst, who contributed this awesome feature.) Bug fixes: * [BUG #321](https://github.com/rust-lang/regex/issues/321): Fix bug in literal extraction and UTF-8 decoding. * [BUG #326](https://github.com/rust-lang/regex/issues/326): Add documentation tip about the `(?x)` flag. * [BUG #333](https://github.com/rust-lang/regex/issues/333): Show additional replacement example using curly braces. * [BUG #334](https://github.com/rust-lang/regex/issues/334): Fix bug when resolving captures after a match. * [BUG #338](https://github.com/rust-lang/regex/issues/338): Add example that uses `Captures::get` to API documentation. * [BUG #353](https://github.com/rust-lang/regex/issues/353): Fix RegexSet bug that caused match failure in some cases. * [BUG #354](https://github.com/rust-lang/regex/pull/354): Fix panic in parser when `(?x)` is used. * [BUG #358](https://github.com/rust-lang/regex/issues/358): Fix literal optimization bug with RegexSet. * [BUG #359](https://github.com/rust-lang/regex/issues/359): Fix example code in README. * [BUG #365](https://github.com/rust-lang/regex/pull/365): Fix bug in `rure_captures_len` in the C binding. * [BUG #367](https://github.com/rust-lang/regex/issues/367): Fix byte class bug that caused a panic. 0.2.1 ===== One major bug with `replace_all` has been fixed along with a couple of other touch-ups. * [BUG #312](https://github.com/rust-lang/regex/issues/312): Fix documentation for `NoExpand` to reference correct lifetime parameter. * [BUG #314](https://github.com/rust-lang/regex/issues/314): Fix a bug with `replace_all` when replacing a match with the empty string. * [BUG #316](https://github.com/rust-lang/regex/issues/316): Note a missing breaking change from the `0.2.0` CHANGELOG entry. (`RegexBuilder::compile` was renamed to `RegexBuilder::build`.) * [BUG #324](https://github.com/rust-lang/regex/issues/324): Compiling `regex` should only require one version of `memchr` crate. 0.2.0 ===== This is a new major release of the regex crate, and is an implementation of the [regex 1.0 RFC](https://github.com/rust-lang/rfcs/blob/master/text/1620-regex-1.0.md). We are releasing a `0.2` first, and if there are no major problems, we will release a `1.0` shortly. For `0.2`, the minimum *supported* Rust version is 1.12. There are a number of **breaking changes** in `0.2`. They are split into two types. The first type correspond to breaking changes in regular expression syntax. The second type correspond to breaking changes in the API. Breaking changes for regex syntax: * POSIX character classes now require double bracketing. Previously, the regex `[:upper:]` would parse as the `upper` POSIX character class. Now it parses as the character class containing the characters `:upper:`. The fix to this change is to use `[[:upper:]]` instead. Note that variants like `[[:upper:][:blank:]]` continue to work. * The character `[` must always be escaped inside a character class. * The characters `&`, `-` and `~` must be escaped if any one of them are repeated consecutively. For example, `[&]`, `[\&]`, `[\&\&]`, `[&-&]` are all equivalent while `[&&]` is illegal. (The motivation for this and the prior change is to provide a backwards compatible path for adding character class set notation.) * A `bytes::Regex` now has Unicode mode enabled by default (like the main `Regex` type). This means regexes compiled with `bytes::Regex::new` that don't have the Unicode flag set should add `(?-u)` to recover the original behavior. Breaking changes for the regex API: * `find` and `find_iter` now **return `Match` values instead of `(usize, usize)`.** `Match` values have `start` and `end` methods, which return the match offsets. `Match` values also have an `as_str` method, which returns the text of the match itself. * The `Captures` type now only provides a single iterator over all capturing matches, which should replace uses of `iter` and `iter_pos`. Uses of `iter_named` should use the `capture_names` method on `Regex`. * The `at` method on the `Captures` type has been renamed to `get`, and it now returns a `Match`. Similarly, the `name` method on `Captures` now returns a `Match`. * The `replace` methods now return `Cow` values. The `Cow::Borrowed` variant is returned when no replacements are made. * The `Replacer` trait has been completely overhauled. This should only impact clients that implement this trait explicitly. Standard uses of the `replace` methods should continue to work unchanged. If you implement the `Replacer` trait, please consult the new documentation. * The `quote` free function has been renamed to `escape`. * The `Regex::with_size_limit` method has been removed. It is replaced by `RegexBuilder::size_limit`. * The `RegexBuilder` type has switched from owned `self` method receivers to `&mut self` method receivers. Most uses will continue to work unchanged, but some code may require naming an intermediate variable to hold the builder. * The `compile` method on `RegexBuilder` has been renamed to `build`. * The free `is_match` function has been removed. It is replaced by compiling a `Regex` and calling its `is_match` method. * The `PartialEq` and `Eq` impls on `Regex` have been dropped. If you relied on these impls, the fix is to define a wrapper type around `Regex`, impl `Deref` on it and provide the necessary impls. * The `is_empty` method on `Captures` has been removed. This always returns `false`, so its use is superfluous. * The `Syntax` variant of the `Error` type now contains a string instead of a `regex_syntax::Error`. If you were examining syntax errors more closely, you'll need to explicitly use the `regex_syntax` crate to re-parse the regex. * The `InvalidSet` variant of the `Error` type has been removed since it is no longer used. * Most of the iterator types have been renamed to match conventions. If you were using these iterator types explicitly, please consult the documentation for its new name. For example, `RegexSplits` has been renamed to `Split`. A number of bugs have been fixed: * [BUG #151](https://github.com/rust-lang/regex/issues/151): The `Replacer` trait has been changed to permit the caller to control allocation. * [BUG #165](https://github.com/rust-lang/regex/issues/165): Remove the free `is_match` function. * [BUG #166](https://github.com/rust-lang/regex/issues/166): Expose more knobs (available in `0.1`) and remove `with_size_limit`. * [BUG #168](https://github.com/rust-lang/regex/issues/168): Iterators produced by `Captures` now have the correct lifetime parameters. * [BUG #175](https://github.com/rust-lang/regex/issues/175): Fix a corner case in the parsing of POSIX character classes. * [BUG #178](https://github.com/rust-lang/regex/issues/178): Drop the `PartialEq` and `Eq` impls on `Regex`. * [BUG #179](https://github.com/rust-lang/regex/issues/179): Remove `is_empty` from `Captures` since it always returns false. * [BUG #276](https://github.com/rust-lang/regex/issues/276): Position of named capture can now be retrieved from a `Captures`. * [BUG #296](https://github.com/rust-lang/regex/issues/296): Remove winapi/kernel32-sys dependency on UNIX. * [BUG #307](https://github.com/rust-lang/regex/issues/307): Fix error on emscripten. 0.1.80 ====== * [PR #292](https://github.com/rust-lang/regex/pull/292): Fixes bug #291, which was introduced by PR #290. 0.1.79 ====== * Require regex-syntax 0.3.8. 0.1.78 ====== * [PR #290](https://github.com/rust-lang/regex/pull/290): Fixes bug #289, which caused some regexes with a certain combination of literals to match incorrectly. 0.1.77 ====== * [PR #281](https://github.com/rust-lang/regex/pull/281): Fixes bug #280 by disabling all literal optimizations when a pattern is partially anchored. 0.1.76 ====== * Tweak criteria for using the Teddy literal matcher. 0.1.75 ====== * [PR #275](https://github.com/rust-lang/regex/pull/275): Improves match verification performance in the Teddy SIMD searcher. * [PR #278](https://github.com/rust-lang/regex/pull/278): Replaces slow substring loop in the Teddy SIMD searcher with Aho-Corasick. * Implemented DoubleEndedIterator on regex set match iterators. 0.1.74 ====== * Release regex-syntax 0.3.5 with a minor bug fix. * Fix bug #272. * Fix bug #277. * [PR #270](https://github.com/rust-lang/regex/pull/270): Fixes bugs #264, #268 and an unreported where the DFA cache size could be drastically underestimated in some cases (leading to high unexpected memory usage). 0.1.73 ====== * Release `regex-syntax 0.3.4`. * Bump `regex-syntax` dependency version for `regex` to `0.3.4`. 0.1.72 ====== * [PR #262](https://github.com/rust-lang/regex/pull/262): Fixes a number of small bugs caught by fuzz testing (AFL). 0.1.71 ====== * [PR #236](https://github.com/rust-lang/regex/pull/236): Fix a bug in how suffix literals were extracted, which could lead to invalid match behavior in some cases. 0.1.70 ====== * [PR #231](https://github.com/rust-lang/regex/pull/231): Add SIMD accelerated multiple pattern search. * [PR #228](https://github.com/rust-lang/regex/pull/228): Reintroduce the reverse suffix literal optimization. * [PR #226](https://github.com/rust-lang/regex/pull/226): Implements NFA state compression in the lazy DFA. * [PR #223](https://github.com/rust-lang/regex/pull/223): A fully anchored RegexSet can now short-circuit. 0.1.69 ====== * [PR #216](https://github.com/rust-lang/regex/pull/216): Tweak the threshold for running backtracking. * [PR #217](https://github.com/rust-lang/regex/pull/217): Add upper limit (from the DFA) to capture search (for the NFA). * [PR #218](https://github.com/rust-lang/regex/pull/218): Add rure, a C API. 0.1.68 ====== * [PR #210](https://github.com/rust-lang/regex/pull/210): Fixed a performance bug in `bytes::Regex::replace` where `extend` was used instead of `extend_from_slice`. * [PR #211](https://github.com/rust-lang/regex/pull/211): Fixed a bug in the handling of word boundaries in the DFA. * [PR #213](https://github.com/rust-lang/pull/213): Added RE2 and Tcl to the benchmark harness. Also added a CLI utility from running regexes using any of the following regex engines: PCRE1, PCRE2, Oniguruma, RE2, Tcl and of course Rust's own regexes. 0.1.67 ====== * [PR #201](https://github.com/rust-lang/regex/pull/201): Fix undefined behavior in the `regex!` compiler plugin macro. * [PR #205](https://github.com/rust-lang/regex/pull/205): More improvements to DFA performance. Competitive with RE2. See PR for benchmarks. * [PR #209](https://github.com/rust-lang/regex/pull/209): Release 0.1.66 was semver incompatible since it required a newer version of Rust than previous releases. This PR fixes that. (And `0.1.66` was yanked.) 0.1.66 ====== * Speculative support for Unicode word boundaries was added to the DFA. This should remove the last common case that disqualified use of the DFA. * An optimization that scanned for suffix literals and then matched the regular expression in reverse was removed because it had worst case quadratic time complexity. It was replaced with a more limited optimization where, given any regex of the form `re$`, it will be matched in reverse from the end of the haystack. * [PR #202](https://github.com/rust-lang/regex/pull/202): The inner loop of the DFA was heavily optimized to improve cache locality and reduce the overall number of instructions run on each iteration. This represents the first use of `unsafe` in `regex` (to elide bounds checks). * [PR #200](https://github.com/rust-lang/regex/pull/200): Use of the `mempool` crate (which used thread local storage) was replaced with a faster version of a similar API in @Amanieu's `thread_local` crate. It should reduce contention when using a regex from multiple threads simultaneously. * PCRE2 JIT benchmarks were added. A benchmark comparison can be found [here](https://gist.github.com/anonymous/14683c01993e91689f7206a18675901b). (Includes a comparison with PCRE1's JIT and Oniguruma.) * A bug where word boundaries weren't being matched correctly in the DFA was fixed. This only affected use of `bytes::Regex`. * [#160](https://github.com/rust-lang/regex/issues/160): `Captures` now has a `Debug` impl. regex-1.12.2/Cargo.lock0000644000000230520000000000100102060ustar # This file is automatically @generated by Cargo. # It is not intended for manual editing. version = 3 [[package]] name = "aho-corasick" version = "1.1.3" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "8e60d3430d3a69478ad0993f19238d2df97c507009a52b3c10addcd7f6bcb916" dependencies = [ "log", "memchr", ] [[package]] name = "anyhow" version = "1.0.100" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "a23eb6b1614318a8071c9b2521f36b424b2c83db5eb3a0fead4a6c0809af6e61" [[package]] name = "atty" version = "0.2.14" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "d9b39be18770d11421cdb1b9947a45dd3f37e93092cbf377614828a319d5fee8" dependencies = [ "hermit-abi", "libc", "winapi", ] [[package]] name = "bstr" version = "1.12.0" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "234113d19d0d7d613b40e86fb654acf958910802bcceab913a4f9e7cda03b1a4" dependencies = [ "memchr", "serde", ] [[package]] name = "cfg-if" version = "1.0.3" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "2fd1289c04a9ea8cb22300a459a72a385d7c73d3259e2ed7dcb2af674838cfa9" [[package]] name = "doc-comment" version = "0.3.3" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "fea41bba32d969b513997752735605054bc0dfa92b4c56bf1189f2e174be7a10" [[package]] name = "env_logger" version = "0.9.3" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "a12e6657c4c97ebab115a42dcee77225f7f482cdd841cf7088c657a42e9e00e7" dependencies = [ "atty", "humantime", "log", "termcolor", ] [[package]] name = "equivalent" version = "1.0.2" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "877a4ace8713b0bcf2a4e7eec82529c029f1d0619886d18145fea96c3ffe5c0f" [[package]] name = "getrandom" version = "0.2.16" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "335ff9f135e4384c8150d6f27c6daed433577f86b4750418338c01a1a2528592" dependencies = [ "cfg-if", "libc", "wasi", ] [[package]] name = "hashbrown" version = "0.16.0" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "5419bdc4f6a9207fbeba6d11b604d481addf78ecd10c11ad51e76c2f6482748d" [[package]] name = "hermit-abi" version = "0.1.19" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "62b467343b94ba476dcb2500d242dadbb39557df889310ac77c5d99100aaac33" dependencies = [ "libc", ] [[package]] name = "humantime" version = "2.3.0" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "135b12329e5e3ce057a9f972339ea52bc954fe1e9358ef27f95e89716fbc5424" [[package]] name = "indexmap" version = "2.11.4" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "4b0f83760fb341a774ed326568e19f5a863af4a952def8c39f9ab92fd95b88e5" dependencies = [ "equivalent", "hashbrown", ] [[package]] name = "libc" version = "0.2.177" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "2874a2af47a2325c2001a6e6fad9b16a53b802102b528163885171cf92b15976" [[package]] name = "log" version = "0.4.28" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "34080505efa8e45a4b816c349525ebe327ceaa8559756f0356cba97ef3bf7432" [[package]] name = "memchr" version = "2.7.6" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "f52b00d39961fc5b2736ea853c9cc86238e165017a493d1d5c8eac6bdc4cc273" dependencies = [ "log", ] [[package]] name = "proc-macro2" version = "1.0.101" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "89ae43fd86e4158d6db51ad8e2b80f313af9cc74f5c0e03ccb87de09998732de" dependencies = [ "unicode-ident", ] [[package]] name = "quickcheck" version = "1.0.3" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "588f6378e4dd99458b60ec275b4477add41ce4fa9f64dcba6f15adccb19b50d6" dependencies = [ "rand", ] [[package]] name = "quote" version = "1.0.41" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "ce25767e7b499d1b604768e7cde645d14cc8584231ea6b295e9c9eb22c02e1d1" dependencies = [ "proc-macro2", ] [[package]] name = "rand" version = "0.8.5" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "34af8d1a0e25924bc5b7c43c079c942339d8f0a8b57c39049bef581b46327404" dependencies = [ "rand_core", ] [[package]] name = "rand_core" version = "0.6.4" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "ec0be4795e2f6a28069bec0b5ff3e2ac9bafc99e6a9a7dc3547996c5c816922c" dependencies = [ "getrandom", ] [[package]] name = "regex" version = "1.12.2" dependencies = [ "aho-corasick", "anyhow", "doc-comment", "env_logger", "memchr", "quickcheck", "regex-automata", "regex-syntax", "regex-test", ] [[package]] name = "regex-automata" version = "0.4.13" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "5276caf25ac86c8d810222b3dbb938e512c55c6831a10f3e6ed1c93b84041f1c" dependencies = [ "aho-corasick", "log", "memchr", "regex-syntax", ] [[package]] name = "regex-syntax" version = "0.8.8" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "7a2d987857b319362043e95f5353c0535c1f58eec5336fdfcf626430af7def58" [[package]] name = "regex-test" version = "0.1.1" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "da40f0939bc4c598b4326abdbb363a8987aa43d0526e5624aefcf3ed90344e62" dependencies = [ "anyhow", "bstr", "serde", "toml", ] [[package]] name = "serde" version = "1.0.228" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "9a8e94ea7f378bd32cbbd37198a4a91436180c5bb472411e48b5ec2e2124ae9e" dependencies = [ "serde_core", "serde_derive", ] [[package]] name = "serde_core" version = "1.0.228" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "41d385c7d4ca58e59fc732af25c3983b67ac852c1a25000afe1175de458b67ad" dependencies = [ "serde_derive", ] [[package]] name = "serde_derive" version = "1.0.228" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "d540f220d3187173da220f885ab66608367b6574e925011a9353e4badda91d79" dependencies = [ "proc-macro2", "quote", "syn", ] [[package]] name = "serde_spanned" version = "0.6.9" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "bf41e0cfaf7226dca15e8197172c295a782857fcb97fad1808a166870dee75a3" dependencies = [ "serde", ] [[package]] name = "syn" version = "2.0.106" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "ede7c438028d4436d71104916910f5bb611972c5cfd7f89b8300a8186e6fada6" dependencies = [ "proc-macro2", "quote", "unicode-ident", ] [[package]] name = "termcolor" version = "1.4.1" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "06794f8f6c5c898b3275aebefa6b8a1cb24cd2c6c79397ab15774837a0bc5755" dependencies = [ "winapi-util", ] [[package]] name = "toml" version = "0.8.23" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "dc1beb996b9d83529a9e75c17a1686767d148d70663143c7854d8b4a09ced362" dependencies = [ "serde", "serde_spanned", "toml_datetime", "toml_edit", ] [[package]] name = "toml_datetime" version = "0.6.11" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "22cddaf88f4fbc13c51aebbf5f8eceb5c7c5a9da2ac40a13519eb5b0a0e8f11c" dependencies = [ "serde", ] [[package]] name = "toml_edit" version = "0.22.27" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "41fe8c660ae4257887cf66394862d21dbca4a6ddd26f04a3560410406a2f819a" dependencies = [ "indexmap", "serde", "serde_spanned", "toml_datetime", "winnow", ] [[package]] name = "unicode-ident" version = "1.0.19" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "f63a545481291138910575129486daeaf8ac54aee4387fe7906919f7830c7d9d" [[package]] name = "wasi" version = "0.11.1+wasi-snapshot-preview1" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "ccf3ec651a847eb01de73ccad15eb7d99f80485de043efb2f370cd654f4ea44b" [[package]] name = "winapi" version = "0.3.9" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "5c839a674fcd7a98952e593242ea400abe93992746761e38641405d28b00f419" dependencies = [ "winapi-i686-pc-windows-gnu", "winapi-x86_64-pc-windows-gnu", ] [[package]] name = "winapi-i686-pc-windows-gnu" version = "0.4.0" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "ac3b87c63620426dd9b991e5ce0329eff545bccbbb34f3be09ff6fb6ab51b7b6" [[package]] name = "winapi-util" version = "0.1.11" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "c2a7b1c03c876122aa43f3020e6c3c3ee5c05081c9a00739faf7503aeba10d22" dependencies = [ "windows-sys", ] [[package]] name = "winapi-x86_64-pc-windows-gnu" version = "0.4.0" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "712e227841d057c1ee1cd2fb22fa7e5a5461ae8e48fa2ca79ec42cfc1931183f" [[package]] name = "windows-link" version = "0.2.1" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "f0805222e57f7521d6a62e36fa9163bc891acd422f971defe97d64e70d0a4fe5" [[package]] name = "windows-sys" version = "0.61.2" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "ae137229bcbd6cdf0f7b80a31df61766145077ddf49416a728b02cb3921ff3fc" dependencies = [ "windows-link", ] [[package]] name = "winnow" version = "0.7.13" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "21a0236b59786fed61e2a80582dd500fe61f18b5dca67a4a067d0bc9039339cf" dependencies = [ "memchr", ] regex-1.12.2/Cargo.toml0000644000000100500000000000100102230ustar # THIS FILE IS AUTOMATICALLY GENERATED BY CARGO # # When uploading crates to the registry Cargo will automatically # "normalize" Cargo.toml files for maximal compatibility # with all versions of Cargo and also rewrite `path` dependencies # to registry (e.g., crates.io) dependencies. # # If you are reading this file be aware that the original Cargo.toml # will likely look very different (and much more reasonable). # See Cargo.toml.orig for the original contents. [package] edition = "2021" rust-version = "1.65" name = "regex" version = "1.12.2" authors = [ "The Rust Project Developers", "Andrew Gallant ", ] build = false exclude = [ "/fuzz/*", "/record/*", "/scripts/*", "tests/fuzz/*", "/.github/*", ] autolib = false autobins = false autoexamples = false autotests = false autobenches = false description = """ An implementation of regular expressions for Rust. This implementation uses finite automata and guarantees linear time matching on all inputs. """ homepage = "https://github.com/rust-lang/regex" documentation = "https://docs.rs/regex" readme = "README.md" categories = ["text-processing"] license = "MIT OR Apache-2.0" repository = "https://github.com/rust-lang/regex" [package.metadata.docs.rs] all-features = true rustdoc-args = [ "--cfg", "docsrs_regex", ] [features] default = [ "std", "perf", "unicode", "regex-syntax/default", ] logging = [ "aho-corasick?/logging", "memchr?/logging", "regex-automata/logging", ] pattern = [] perf = [ "perf-cache", "perf-dfa", "perf-onepass", "perf-backtrack", "perf-inline", "perf-literal", ] perf-backtrack = ["regex-automata/nfa-backtrack"] perf-cache = [] perf-dfa = ["regex-automata/hybrid"] perf-dfa-full = [ "regex-automata/dfa-build", "regex-automata/dfa-search", ] perf-inline = ["regex-automata/perf-inline"] perf-literal = [ "dep:aho-corasick", "dep:memchr", "regex-automata/perf-literal", ] perf-onepass = ["regex-automata/dfa-onepass"] std = [ "aho-corasick?/std", "memchr?/std", "regex-automata/std", "regex-syntax/std", ] unicode = [ "unicode-age", "unicode-bool", "unicode-case", "unicode-gencat", "unicode-perl", "unicode-script", "unicode-segment", "regex-automata/unicode", "regex-syntax/unicode", ] unicode-age = [ "regex-automata/unicode-age", "regex-syntax/unicode-age", ] unicode-bool = [ "regex-automata/unicode-bool", "regex-syntax/unicode-bool", ] unicode-case = [ "regex-automata/unicode-case", "regex-syntax/unicode-case", ] unicode-gencat = [ "regex-automata/unicode-gencat", "regex-syntax/unicode-gencat", ] unicode-perl = [ "regex-automata/unicode-perl", "regex-automata/unicode-word-boundary", "regex-syntax/unicode-perl", ] unicode-script = [ "regex-automata/unicode-script", "regex-syntax/unicode-script", ] unicode-segment = [ "regex-automata/unicode-segment", "regex-syntax/unicode-segment", ] unstable = ["pattern"] use_std = ["std"] [lib] name = "regex" path = "src/lib.rs" [[test]] name = "integration" path = "tests/lib.rs" [dependencies.aho-corasick] version = "1.0.0" optional = true default-features = false [dependencies.memchr] version = "2.6.0" optional = true default-features = false [dependencies.regex-automata] version = "0.4.12" features = [ "alloc", "syntax", "meta", "nfa-pikevm", ] default-features = false [dependencies.regex-syntax] version = "0.8.5" default-features = false [dev-dependencies.anyhow] version = "1.0.69" [dev-dependencies.doc-comment] version = "0.3" [dev-dependencies.env_logger] version = "0.9.3" features = [ "atty", "humantime", "termcolor", ] default-features = false [dev-dependencies.quickcheck] version = "1.0.3" default-features = false [dev-dependencies.regex-test] version = "0.1.0" [lints.rust.unexpected_cfgs] level = "allow" priority = 0 check-cfg = ["cfg(docsrs_regex)"] [profile.bench] debug = 2 [profile.dev] opt-level = 3 debug = 2 [profile.release] debug = 2 [profile.test] opt-level = 3 debug = 2 regex-1.12.2/Cargo.toml.orig000064400000000000000000000211461046102023000137140ustar 00000000000000[package] name = "regex" version = "1.12.2" #:version authors = ["The Rust Project Developers", "Andrew Gallant "] license = "MIT OR Apache-2.0" readme = "README.md" repository = "https://github.com/rust-lang/regex" documentation = "https://docs.rs/regex" homepage = "https://github.com/rust-lang/regex" description = """ An implementation of regular expressions for Rust. This implementation uses finite automata and guarantees linear time matching on all inputs. """ categories = ["text-processing"] autotests = false exclude = ["/fuzz/*", "/record/*", "/scripts/*", "tests/fuzz/*", "/.github/*"] edition = "2021" rust-version = "1.65" [workspace] members = [ "regex-automata", "regex-capi", "regex-cli", "regex-lite", "regex-syntax", "regex-test", ] # Features are documented in the "Crate features" section of the crate docs: # https://docs.rs/regex/*/#crate-features [features] default = ["std", "perf", "unicode", "regex-syntax/default"] # ECOSYSTEM FEATURES # The 'std' feature permits the regex crate to use the standard library. This # is intended to support future use cases where the regex crate may be able # to compile without std, and instead just rely on 'core' and 'alloc' (for # example). Currently, this isn't supported, and removing the 'std' feature # will prevent regex from compiling. std = [ "aho-corasick?/std", "memchr?/std", "regex-automata/std", "regex-syntax/std", ] # This feature enables the 'log' crate to emit messages. This is usually # only useful for folks working on the regex crate itself, but can be useful # if you're trying hard to do some performance hacking on regex patterns # themselves. Note that you'll need to pair this with a crate like 'env_logger' # to actually emit the log messages somewhere. logging = [ "aho-corasick?/logging", "memchr?/logging", "regex-automata/logging", ] # The 'use_std' feature is DEPRECATED. It will be removed in regex 2. Until # then, it is an alias for the 'std' feature. use_std = ["std"] # PERFORMANCE FEATURES # Enables all default performance features. Note that this specifically does # not include perf-dfa-full, because it leads to higher compile times and # bigger binaries, and the runtime performance improvement is not obviously # worth it. perf = [ "perf-cache", "perf-dfa", "perf-onepass", "perf-backtrack", "perf-inline", "perf-literal", ] # Enables use of a lazy DFA when possible. perf-dfa = ["regex-automata/hybrid"] # Enables use of a fully compiled DFA when possible. perf-dfa-full = ["regex-automata/dfa-build", "regex-automata/dfa-search"] # Enables use of the one-pass regex matcher, which speeds up capture searches # even beyond the backtracker. perf-onepass = ["regex-automata/dfa-onepass"] # Enables use of a bounded backtracker, which speeds up capture searches. perf-backtrack = ["regex-automata/nfa-backtrack"] # Enables aggressive use of inlining. perf-inline = ["regex-automata/perf-inline"] # Enables literal optimizations. perf-literal = [ "dep:aho-corasick", "dep:memchr", "regex-automata/perf-literal", ] # Enables fast caching. (If disabled, caching is still used, but is slower.) # Currently, this feature has no effect. It used to remove the thread_local # dependency and use a slower internal cache, but now the default cache has # been improved and thread_local is no longer a dependency at all. perf-cache = [] # UNICODE DATA FEATURES # Enables all Unicode features. This expands if new Unicode features are added. unicode = [ "unicode-age", "unicode-bool", "unicode-case", "unicode-gencat", "unicode-perl", "unicode-script", "unicode-segment", "regex-automata/unicode", "regex-syntax/unicode", ] # Enables use of the `Age` property, e.g., `\p{Age:3.0}`. unicode-age = [ "regex-automata/unicode-age", "regex-syntax/unicode-age", ] # Enables use of a smattering of boolean properties, e.g., `\p{Emoji}`. unicode-bool = [ "regex-automata/unicode-bool", "regex-syntax/unicode-bool", ] # Enables Unicode-aware case insensitive matching, e.g., `(?i)ฮฒ`. unicode-case = [ "regex-automata/unicode-case", "regex-syntax/unicode-case", ] # Enables Unicode general categories, e.g., `\p{Letter}` or `\pL`. unicode-gencat = [ "regex-automata/unicode-gencat", "regex-syntax/unicode-gencat", ] # Enables Unicode-aware Perl classes corresponding to `\w`, `\s` and `\d`. unicode-perl = [ "regex-automata/unicode-perl", "regex-automata/unicode-word-boundary", "regex-syntax/unicode-perl", ] # Enables Unicode scripts and script extensions, e.g., `\p{Greek}`. unicode-script = [ "regex-automata/unicode-script", "regex-syntax/unicode-script", ] # Enables Unicode segmentation properties, e.g., `\p{gcb=Extend}`. unicode-segment = [ "regex-automata/unicode-segment", "regex-syntax/unicode-segment", ] # UNSTABLE FEATURES (requires Rust nightly) # A blanket feature that governs whether unstable features are enabled or not. # Unstable features are disabled by default, and typically rely on unstable # features in rustc itself. unstable = ["pattern"] # Enable to use the unstable pattern traits defined in std. This is enabled # by default if the unstable feature is enabled. pattern = [] # For very fast multi-prefix literal matching. [dependencies.aho-corasick] version = "1.0.0" optional = true default-features = false # For skipping along search text quickly when a leading byte is known. [dependencies.memchr] version = "2.6.0" optional = true default-features = false # For the actual regex engines. [dependencies.regex-automata] path = "regex-automata" version = "0.4.12" default-features = false features = ["alloc", "syntax", "meta", "nfa-pikevm"] # For parsing regular expressions. [dependencies.regex-syntax] path = "regex-syntax" version = "0.8.5" default-features = false [dev-dependencies] # For property based tests. quickcheck = { version = "1.0.3", default-features = false } # To check README's example doc-comment = "0.3" # For easy error handling in integration tests. anyhow = "1.0.69" # A library for testing regex engines. regex-test = { path = "regex-test", version = "0.1.0" } [dev-dependencies.env_logger] # Note that this is currently using an older version because of the dependency # tree explosion that happened in 0.10. version = "0.9.3" default-features = false features = ["atty", "humantime", "termcolor"] # This test suite reads a whole boatload of tests from the top-level testdata # directory, and then runs them against the regex crate API. # # regex-automata has its own version of them, and runs them against each # internal regex engine individually. # # This means that if you're seeing a failure in this test suite, you should # try running regex-automata's tests: # # cargo test --manifest-path regex-automata/Cargo.toml --test integration # # That *might* give you a more targeted test failure. i.e., "only the # PikeVM fails this test." Which gives you a narrower place to search. If # regex-automata's test suite passes, then the bug might be in the integration # of the regex crate and regex-automata. But generally speaking, a failure # in this test suite *should* mean there is a corresponding failure in # regex-automata's test suite. [[test]] path = "tests/lib.rs" name = "integration" [package.metadata.docs.rs] # We want to document all features. all-features = true # Since this crate's feature setup is pretty complicated, it is worth opting # into a nightly unstable option to show the features that need to be enabled # for public API items. To do that, we set 'docsrs_regex', and when that's # enabled, we enable the 'doc_cfg' feature. # # To test this locally, run: # # RUSTDOCFLAGS="--cfg docsrs_regex" cargo +nightly doc --all-features # # Note that we use `docsrs_regex` instead of the more standard `docsrs` because # other crates use that same `cfg` knob. And since we are enabling a nightly # feature, they sometimes break. By using our "own" `cfg` knob, we are closer # to being masters of our own destiny. rustdoc-args = ["--cfg", "docsrs_regex"] # This squashes the (AFAIK) erroneous warning that `docsrs_regex` is not a # valid `cfg` knob. [lints.rust] unexpected_cfgs = { level = "allow", check-cfg = ['cfg(docsrs_regex)'] } [profile.release] debug = true [profile.bench] debug = true [profile.dev] # Running tests takes too long in debug mode, so we forcefully always build # with optimizations. Unfortunate, but, ยฏ\_(ใƒ„)_/ยฏ. # # It's counter-intuitive that this needs to be set on dev *and* test, but # it's because the tests that take a long time to run are run as integration # tests in a separate crate. The test.opt-level setting won't apply there, so # we need to set the opt-level across the entire build. opt-level = 3 debug = true [profile.test] opt-level = 3 debug = true regex-1.12.2/Cross.toml000064400000000000000000000001601046102023000130040ustar 00000000000000[build.env] passthrough = [ "RUST_BACKTRACE", "RUST_LOG", "REGEX_TEST", "REGEX_TEST_VERBOSE", ] regex-1.12.2/LICENSE-APACHE000064400000000000000000000251371046102023000127550ustar 00000000000000 Apache License Version 2.0, January 2004 http://www.apache.org/licenses/ TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 1. Definitions. "License" shall mean the terms and conditions for use, reproduction, and distribution as defined by Sections 1 through 9 of this document. "Licensor" shall mean the copyright owner or entity authorized by the copyright owner that is granting the License. "Legal Entity" shall mean the union of the acting entity and all other entities that control, are controlled by, or are under common control with that entity. For the purposes of this definition, "control" means (i) the power, direct or indirect, to cause the direction or management of such entity, whether by contract or otherwise, or (ii) ownership of fifty percent (50%) or more of the outstanding shares, or (iii) beneficial ownership of such entity. "You" (or "Your") shall mean an individual or Legal Entity exercising permissions granted by this License. "Source" form shall mean the preferred form for making modifications, including but not limited to software source code, documentation source, and configuration files. "Object" form shall mean any form resulting from mechanical transformation or translation of a Source form, including but not limited to compiled object code, generated documentation, and conversions to other media types. "Work" shall mean the work of authorship, whether in Source or Object form, made available under the License, as indicated by a copyright notice that is included in or attached to the work (an example is provided in the Appendix below). "Derivative Works" shall mean any work, whether in Source or Object form, that is based on (or derived from) the Work and for which the editorial revisions, annotations, elaborations, or other modifications represent, as a whole, an original work of authorship. For the purposes of this License, Derivative Works shall not include works that remain separable from, or merely link (or bind by name) to the interfaces of, the Work and Derivative Works thereof. "Contribution" shall mean any work of authorship, including the original version of the Work and any modifications or additions to that Work or Derivative Works thereof, that is intentionally submitted to Licensor for inclusion in the Work by the copyright owner or by an individual or Legal Entity authorized to submit on behalf of the copyright owner. For the purposes of this definition, "submitted" means any form of electronic, verbal, or written communication sent to the Licensor or its representatives, including but not limited to communication on electronic mailing lists, source code control systems, and issue tracking systems that are managed by, or on behalf of, the Licensor for the purpose of discussing and improving the Work, but excluding communication that is conspicuously marked or otherwise designated in writing by the copyright owner as "Not a Contribution." "Contributor" shall mean Licensor and any individual or Legal Entity on behalf of whom a Contribution has been received by Licensor and subsequently incorporated within the Work. 2. Grant of Copyright License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable copyright license to reproduce, prepare Derivative Works of, publicly display, publicly perform, sublicense, and distribute the Work and such Derivative Works in Source or Object form. 3. Grant of Patent License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable (except as stated in this section) patent license to make, have made, use, offer to sell, sell, import, and otherwise transfer the Work, where such license applies only to those patent claims licensable by such Contributor that are necessarily infringed by their Contribution(s) alone or by combination of their Contribution(s) with the Work to which such Contribution(s) was submitted. If You institute patent litigation against any entity (including a cross-claim or counterclaim in a lawsuit) alleging that the Work or a Contribution incorporated within the Work constitutes direct or contributory patent infringement, then any patent licenses granted to You under this License for that Work shall terminate as of the date such litigation is filed. 4. Redistribution. You may reproduce and distribute copies of the Work or Derivative Works thereof in any medium, with or without modifications, and in Source or Object form, provided that You meet the following conditions: (a) You must give any other recipients of the Work or Derivative Works a copy of this License; and (b) You must cause any modified files to carry prominent notices stating that You changed the files; and (c) You must retain, in the Source form of any Derivative Works that You distribute, all copyright, patent, trademark, and attribution notices from the Source form of the Work, excluding those notices that do not pertain to any part of the Derivative Works; and (d) If the Work includes a "NOTICE" text file as part of its distribution, then any Derivative Works that You distribute must include a readable copy of the attribution notices contained within such NOTICE file, excluding those notices that do not pertain to any part of the Derivative Works, in at least one of the following places: within a NOTICE text file distributed as part of the Derivative Works; within the Source form or documentation, if provided along with the Derivative Works; or, within a display generated by the Derivative Works, if and wherever such third-party notices normally appear. The contents of the NOTICE file are for informational purposes only and do not modify the License. You may add Your own attribution notices within Derivative Works that You distribute, alongside or as an addendum to the NOTICE text from the Work, provided that such additional attribution notices cannot be construed as modifying the License. You may add Your own copyright statement to Your modifications and may provide additional or different license terms and conditions for use, reproduction, or distribution of Your modifications, or for any such Derivative Works as a whole, provided Your use, reproduction, and distribution of the Work otherwise complies with the conditions stated in this License. 5. Submission of Contributions. Unless You explicitly state otherwise, any Contribution intentionally submitted for inclusion in the Work by You to the Licensor shall be under the terms and conditions of this License, without any additional terms or conditions. Notwithstanding the above, nothing herein shall supersede or modify the terms of any separate license agreement you may have executed with Licensor regarding such Contributions. 6. Trademarks. This License does not grant permission to use the trade names, trademarks, service marks, or product names of the Licensor, except as required for reasonable and customary use in describing the origin of the Work and reproducing the content of the NOTICE file. 7. Disclaimer of Warranty. Unless required by applicable law or agreed to in writing, Licensor provides the Work (and each Contributor provides its Contributions) on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, including, without limitation, any warranties or conditions of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A PARTICULAR PURPOSE. You are solely responsible for determining the appropriateness of using or redistributing the Work and assume any risks associated with Your exercise of permissions under this License. 8. Limitation of Liability. In no event and under no legal theory, whether in tort (including negligence), contract, or otherwise, unless required by applicable law (such as deliberate and grossly negligent acts) or agreed to in writing, shall any Contributor be liable to You for damages, including any direct, indirect, special, incidental, or consequential damages of any character arising as a result of this License or out of the use or inability to use the Work (including but not limited to damages for loss of goodwill, work stoppage, computer failure or malfunction, or any and all other commercial damages or losses), even if such Contributor has been advised of the possibility of such damages. 9. Accepting Warranty or Additional Liability. While redistributing the Work or Derivative Works thereof, You may choose to offer, and charge a fee for, acceptance of support, warranty, indemnity, or other liability obligations and/or rights consistent with this License. However, in accepting such obligations, You may act only on Your own behalf and on Your sole responsibility, not on behalf of any other Contributor, and only if You agree to indemnify, defend, and hold each Contributor harmless for any liability incurred by, or claims asserted against, such Contributor by reason of your accepting any such warranty or additional liability. END OF TERMS AND CONDITIONS APPENDIX: How to apply the Apache License to your work. To apply the Apache License to your work, attach the following boilerplate notice, with the fields enclosed by brackets "[]" replaced with your own identifying information. (Don't include the brackets!) The text should be enclosed in the appropriate comment syntax for the file format. We also recommend that a file or class name and description of purpose be included on the same "printed page" as the copyright notice for easier identification within third-party archives. Copyright [yyyy] [name of copyright owner] Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. regex-1.12.2/LICENSE-MIT000064400000000000000000000020571046102023000124610ustar 00000000000000Copyright (c) 2014 The Rust Project Developers Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. regex-1.12.2/README.md000064400000000000000000000275741046102023000123170ustar 00000000000000regex ===== This crate provides routines for searching strings for matches of a [regular expression] (aka "regex"). The regex syntax supported by this crate is similar to other regex engines, but it lacks several features that are not known how to implement efficiently. This includes, but is not limited to, look-around and backreferences. In exchange, all regex searches in this crate have worst case `O(m * n)` time complexity, where `m` is proportional to the size of the regex and `n` is proportional to the size of the string being searched. [regular expression]: https://en.wikipedia.org/wiki/Regular_expression [![Build status](https://github.com/rust-lang/regex/workflows/ci/badge.svg)](https://github.com/rust-lang/regex/actions) [![Crates.io](https://img.shields.io/crates/v/regex.svg)](https://crates.io/crates/regex) ### Documentation [Module documentation with examples](https://docs.rs/regex). The module documentation also includes a comprehensive description of the syntax supported. Documentation with examples for the various matching functions and iterators can be found on the [`Regex` type](https://docs.rs/regex/*/regex/struct.Regex.html). ### Usage To bring this crate into your repository, either add `regex` to your `Cargo.toml`, or run `cargo add regex`. Here's a simple example that matches a date in YYYY-MM-DD format and prints the year, month and day: ```rust use regex::Regex; fn main() { let re = Regex::new(r"(?x) (?P\d{4}) # the year - (?P\d{2}) # the month - (?P\d{2}) # the day ").unwrap(); let caps = re.captures("2010-03-14").unwrap(); assert_eq!("2010", &caps["year"]); assert_eq!("03", &caps["month"]); assert_eq!("14", &caps["day"]); } ``` If you have lots of dates in text that you'd like to iterate over, then it's easy to adapt the above example with an iterator: ```rust use regex::Regex; fn main() { let re = Regex::new(r"(\d{4})-(\d{2})-(\d{2})").unwrap(); let hay = "On 2010-03-14, foo happened. On 2014-10-14, bar happened."; let mut dates = vec![]; for (_, [year, month, day]) in re.captures_iter(hay).map(|c| c.extract()) { dates.push((year, month, day)); } assert_eq!(dates, vec![ ("2010", "03", "14"), ("2014", "10", "14"), ]); } ``` ### Usage: Avoid compiling the same regex in a loop It is an anti-pattern to compile the same regular expression in a loop since compilation is typically expensive. (It takes anywhere from a few microseconds to a few **milliseconds** depending on the size of the regex.) Not only is compilation itself expensive, but this also prevents optimizations that reuse allocations internally to the matching engines. In Rust, it can sometimes be a pain to pass regular expressions around if they're used from inside a helper function. Instead, we recommend using [`std::sync::LazyLock`], or the [`once_cell`] crate, if you can't use the standard library. This example shows how to use `std::sync::LazyLock`: ```rust use std::sync::LazyLock; use regex::Regex; fn some_helper_function(haystack: &str) -> bool { static RE: LazyLock = LazyLock::new(|| Regex::new(r"...").unwrap()); RE.is_match(haystack) } fn main() { assert!(some_helper_function("abc")); assert!(!some_helper_function("ac")); } ``` Specifically, in this example, the regex will be compiled when it is used for the first time. On subsequent uses, it will reuse the previous compilation. [`std::sync::LazyLock`]: https://doc.rust-lang.org/std/sync/struct.LazyLock.html [`once_cell`]: https://crates.io/crates/once_cell ### Usage: match regular expressions on `&[u8]` The main API of this crate (`regex::Regex`) requires the caller to pass a `&str` for searching. In Rust, an `&str` is required to be valid UTF-8, which means the main API can't be used for searching arbitrary bytes. To match on arbitrary bytes, use the `regex::bytes::Regex` API. The API is identical to the main API, except that it takes an `&[u8]` to search on instead of an `&str`. The `&[u8]` APIs also permit disabling Unicode mode in the regex even when the pattern would match invalid UTF-8. For example, `(?-u:.)` is not allowed in `regex::Regex` but is allowed in `regex::bytes::Regex` since `(?-u:.)` matches any byte except for `\n`. Conversely, `.` will match the UTF-8 encoding of any Unicode scalar value except for `\n`. This example shows how to find all null-terminated strings in a slice of bytes: ```rust use regex::bytes::Regex; let re = Regex::new(r"(?-u)(?[^\x00]+)\x00").unwrap(); let text = b"foo\xFFbar\x00baz\x00"; // Extract all of the strings without the null terminator from each match. // The unwrap is OK here since a match requires the `cstr` capture to match. let cstrs: Vec<&[u8]> = re.captures_iter(text) .map(|c| c.name("cstr").unwrap().as_bytes()) .collect(); assert_eq!(vec![&b"foo\xFFbar"[..], &b"baz"[..]], cstrs); ``` Notice here that the `[^\x00]+` will match any *byte* except for `NUL`, including bytes like `\xFF` which are not valid UTF-8. When using the main API, `[^\x00]+` would instead match any valid UTF-8 sequence except for `NUL`. ### Usage: match multiple regular expressions simultaneously This demonstrates how to use a `RegexSet` to match multiple (possibly overlapping) regular expressions in a single scan of the search text: ```rust use regex::RegexSet; let set = RegexSet::new(&[ r"\w+", r"\d+", r"\pL+", r"foo", r"bar", r"barfoo", r"foobar", ]).unwrap(); // Iterate over and collect all of the matches. let matches: Vec<_> = set.matches("foobar").into_iter().collect(); assert_eq!(matches, vec![0, 2, 3, 4, 6]); // You can also test whether a particular regex matched: let matches = set.matches("foobar"); assert!(!matches.matched(5)); assert!(matches.matched(6)); ``` ### Usage: regex internals as a library The [`regex-automata` directory](./regex-automata/) contains a crate that exposes all the internal matching engines used by the `regex` crate. The idea is that the `regex` crate exposes a simple API for 99% of use cases, but `regex-automata` exposes oodles of customizable behaviors. [Documentation for `regex-automata`.](https://docs.rs/regex-automata) ### Usage: a regular expression parser This repository contains a crate that provides a well tested regular expression parser, abstract syntax and a high-level intermediate representation for convenient analysis. It provides no facilities for compilation or execution. This may be useful if you're implementing your own regex engine or otherwise need to do analysis on the syntax of a regular expression. It is otherwise not recommended for general use. [Documentation for `regex-syntax`.](https://docs.rs/regex-syntax) ### Crate features This crate comes with several features that permit tweaking the trade-off between binary size, compilation time and runtime performance. Users of this crate can selectively disable Unicode tables, or choose from a variety of optimizations performed by this crate to disable. When all of these features are disabled, runtime match performance may be much worse, but if you're matching on short strings, or if high performance isn't necessary, then such a configuration is perfectly serviceable. To disable all such features, use the following `Cargo.toml` dependency configuration: ```toml [dependencies.regex] version = "1.3" default-features = false # Unless you have a specific reason not to, it's good sense to enable standard # library support. It enables several optimizations and avoids spin locks. It # also shouldn't meaningfully impact compile times or binary size. features = ["std"] ``` This will reduce the dependency tree of `regex` down to two crates: `regex-syntax` and `regex-automata`. The full set of features one can disable are [in the "Crate features" section of the documentation](https://docs.rs/regex/1.*/#crate-features). ### Performance One of the goals of this crate is for the regex engine to be "fast." What that is a somewhat nebulous goal, it is usually interpreted in one of two ways. First, it means that all searches take worst case `O(m * n)` time, where `m` is proportional to `len(regex)` and `n` is proportional to `len(haystack)`. Second, it means that even aside from the time complexity constraint, regex searches are "fast" in practice. While the first interpretation is pretty unambiguous, the second one remains nebulous. While nebulous, it guides this crate's architecture and the sorts of the trade-offs it makes. For example, here are some general architectural statements that follow as a result of the goal to be "fast": * When given the choice between faster regex searches and faster _Rust compile times_, this crate will generally choose faster regex searches. * When given the choice between faster regex searches and faster _regex compile times_, this crate will generally choose faster regex searches. That is, it is generally acceptable for `Regex::new` to get a little slower if it means that searches get faster. (This is a somewhat delicate balance to strike, because the speed of `Regex::new` needs to remain somewhat reasonable. But this is why one should avoid re-compiling the same regex over and over again.) * When given the choice between faster regex searches and simpler API design, this crate will generally choose faster regex searches. For example, if one didn't care about performance, we could like get rid of both of the `Regex::is_match` and `Regex::find` APIs and instead just rely on `Regex::captures`. There are perhaps more ways that being "fast" influences things. While this repository used to provide its own benchmark suite, it has since been moved to [rebar](https://github.com/BurntSushi/rebar). The benchmarks are quite extensive, and there are many more than what is shown in rebar's README (which is just limited to a "curated" set meant to compare performance between regex engines). To run all of this crate's benchmarks, first start by cloning and installing `rebar`: ```text $ git clone https://github.com/BurntSushi/rebar $ cd rebar $ cargo install --path ./ ``` Then build the benchmark harness for just this crate: ```text $ rebar build -e '^rust/regex$' ``` Run all benchmarks for this crate as tests (each benchmark is executed once to ensure it works): ```text $ rebar measure -e '^rust/regex$' -t ``` Record measurements for all benchmarks and save them to a CSV file: ```text $ rebar measure -e '^rust/regex$' | tee results.csv ``` Explore benchmark timings: ```text $ rebar cmp results.csv ``` See the `rebar` documentation for more details on how it works and how to compare results with other regex engines. ### Hacking The `regex` crate is, for the most part, a pretty thin wrapper around the [`meta::Regex`](https://docs.rs/regex-automata/latest/regex_automata/meta/struct.Regex.html) from the [`regex-automata` crate](https://docs.rs/regex-automata/latest/regex_automata/). Therefore, if you're looking to work on the internals of this crate, you'll likely either want to look in `regex-syntax` (for parsing) or `regex-automata` (for construction of finite automata and the search routines). My [blog on regex internals](https://burntsushi.net/regex-internals/) goes into more depth. ### Minimum Rust version policy This crate's minimum supported `rustc` version is `1.65.0`. The policy is that the minimum Rust version required to use this crate can be increased in minor version updates. For example, if regex 1.0 requires Rust 1.20.0, then regex 1.0.z for all values of `z` will also require Rust 1.20.0 or newer. However, regex 1.y for `y > 0` may require a newer minimum version of Rust. ### License This project is licensed under either of * Apache License, Version 2.0, ([LICENSE-APACHE](LICENSE-APACHE) or https://www.apache.org/licenses/LICENSE-2.0) * MIT license ([LICENSE-MIT](LICENSE-MIT) or https://opensource.org/licenses/MIT) at your option. The data in `regex-syntax/src/unicode_tables/` is licensed under the Unicode License Agreement ([LICENSE-UNICODE](https://www.unicode.org/copyright.html#License)). regex-1.12.2/UNICODE.md000064400000000000000000000243151046102023000124360ustar 00000000000000# Unicode conformance This document describes the regex crate's conformance to Unicode's [UTS#18](https://unicode.org/reports/tr18/) report, which lays out 3 levels of support: Basic, Extended and Tailored. Full support for Level 1 ("Basic Unicode Support") is provided with two exceptions: 1. Line boundaries are not Unicode aware. Namely, only the `\n` (`END OF LINE`) character is recognized as a line boundary by default. One can opt into `\r\n|\r|\n` being a line boundary via CRLF mode. 2. The compatibility properties specified by [RL1.2a](https://unicode.org/reports/tr18/#RL1.2a) are ASCII-only definitions. Little to no support is provided for either Level 2 or Level 3. For the most part, this is because the features are either complex/hard to implement, or at the very least, very difficult to implement without sacrificing performance. For example, tackling canonical equivalence such that matching worked as one would expect regardless of normalization form would be a significant undertaking. This is at least partially a result of the fact that this regex engine is based on finite automata, which admits less flexibility normally associated with backtracking implementations. ## RL1.1 Hex Notation [UTS#18 RL1.1](https://unicode.org/reports/tr18/#Hex_notation) Hex Notation refers to the ability to specify a Unicode code point in a regular expression via its hexadecimal code point representation. This is useful in environments that have poor Unicode font rendering or if you need to express a code point that is not normally displayable. All forms of hexadecimal notation are supported \x7F hex character code (exactly two digits) \x{10FFFF} any hex character code corresponding to a Unicode code point \u007F hex character code (exactly four digits) \u{7F} any hex character code corresponding to a Unicode code point \U0000007F hex character code (exactly eight digits) \U{7F} any hex character code corresponding to a Unicode code point Briefly, the `\x{...}`, `\u{...}` and `\U{...}` are all exactly equivalent ways of expressing hexadecimal code points. Any number of digits can be written within the brackets. In contrast, `\xNN`, `\uNNNN`, `\UNNNNNNNN` are all fixed-width variants of the same idea. Note that when Unicode mode is disabled, any non-ASCII Unicode codepoint is banned. Additionally, the `\xNN` syntax represents arbitrary bytes when Unicode mode is disabled. That is, the regex `\xFF` matches the Unicode codepoint U+00FF (encoded as `\xC3\xBF` in UTF-8) while the regex `(?-u)\xFF` matches the literal byte `\xFF`. ## RL1.2 Properties [UTS#18 RL1.2](https://unicode.org/reports/tr18/#Categories) Full support for Unicode property syntax is provided. Unicode properties provide a convenient way to construct character classes of groups of code points specified by Unicode. The regex crate does not provide exhaustive support, but covers a useful subset. In particular: * [General categories](https://unicode.org/reports/tr18/#General_Category_Property) * [Scripts and Script Extensions](https://unicode.org/reports/tr18/#Script_Property) * [Age](https://unicode.org/reports/tr18/#Age) * A smattering of boolean properties, including all of those specified by [RL1.2](https://unicode.org/reports/tr18/#RL1.2) explicitly. In all cases, property name and value abbreviations are supported, and all names/values are matched loosely without regard for case, whitespace or underscores. Property name aliases can be found in Unicode's [`PropertyAliases.txt`](https://www.unicode.org/Public/UCD/latest/ucd/PropertyAliases.txt) file, while property value aliases can be found in Unicode's [`PropertyValueAliases.txt`](https://www.unicode.org/Public/UCD/latest/ucd/PropertyValueAliases.txt) file. The syntax supported is also consistent with the UTS#18 recommendation: * `\p{Greek}` selects the `Greek` script. Equivalent expressions follow: `\p{sc:Greek}`, `\p{Script:Greek}`, `\p{Sc=Greek}`, `\p{script=Greek}`, `\P{sc!=Greek}`. Similarly for `General_Category` (or `gc` for short) and `Script_Extensions` (or `scx` for short). * `\p{age:3.2}` selects all code points in Unicode 3.2. * `\p{Alphabetic}` selects the "alphabetic" property and can be abbreviated via `\p{alpha}` (for example). * Single letter variants for properties with single letter abbreviations. For example, `\p{Letter}` can be equivalently written as `\pL`. The following is a list of all properties supported by the regex crate (starred properties correspond to properties required by RL1.2): * `General_Category` \* (including `Any`, `ASCII` and `Assigned`) * `Script` \* * `Script_Extensions` \* * `Age` * `ASCII_Hex_Digit` * `Alphabetic` \* * `Bidi_Control` * `Case_Ignorable` * `Cased` * `Changes_When_Casefolded` * `Changes_When_Casemapped` * `Changes_When_Lowercased` * `Changes_When_Titlecased` * `Changes_When_Uppercased` * `Dash` * `Default_Ignorable_Code_Point` \* * `Deprecated` * `Diacritic` * `Emoji` * `Emoji_Presentation` * `Emoji_Modifier` * `Emoji_Modifier_Base` * `Emoji_Component` * `Extended_Pictographic` * `Extender` * `Grapheme_Base` * `Grapheme_Cluster_Break` * `Grapheme_Extend` * `Hex_Digit` * `IDS_Binary_Operator` * `IDS_Trinary_Operator` * `ID_Continue` * `ID_Start` * `Join_Control` * `Logical_Order_Exception` * `Lowercase` \* * `Math` * `Noncharacter_Code_Point` \* * `Pattern_Syntax` * `Pattern_White_Space` * `Prepended_Concatenation_Mark` * `Quotation_Mark` * `Radical` * `Regional_Indicator` * `Sentence_Break` * `Sentence_Terminal` * `Soft_Dotted` * `Terminal_Punctuation` * `Unified_Ideograph` * `Uppercase` \* * `Variation_Selector` * `White_Space` \* * `Word_Break` * `XID_Continue` * `XID_Start` ## RL1.2a Compatibility Properties [UTS#18 RL1.2a](https://unicode.org/reports/tr18/#RL1.2a) The regex crate only provides ASCII definitions of the [compatibility properties documented in UTS#18 Annex C](https://unicode.org/reports/tr18/#Compatibility_Properties) (sans the `\X` class, for matching grapheme clusters, which isn't provided at all). This is because it seems to be consistent with most other regular expression engines, and in particular, because these are often referred to as "ASCII" or "POSIX" character classes. Note that the `\w`, `\s` and `\d` character classes **are** Unicode aware. Their traditional ASCII definition can be used by disabling Unicode. That is, `[[:word:]]` and `(?-u)\w` are equivalent. ## RL1.3 Subtraction and Intersection [UTS#18 RL1.3](https://unicode.org/reports/tr18/#Subtraction_and_Intersection) The regex crate provides full support for nested character classes, along with union, intersection (`&&`), difference (`--`) and symmetric difference (`~~`) operations on arbitrary character classes. For example, to match all non-ASCII letters, you could use either `[\p{Letter}--\p{Ascii}]` (difference) or `[\p{Letter}&&[^\p{Ascii}]]` (intersecting the negation). ## RL1.4 Simple Word Boundaries [UTS#18 RL1.4](https://unicode.org/reports/tr18/#Simple_Word_Boundaries) The regex crate provides basic Unicode aware word boundary assertions. A word boundary assertion can be written as `\b`, or `\B` as its negation. A word boundary negation corresponds to a zero-width match, where its adjacent characters correspond to word and non-word, or non-word and word characters. Conformance in this case chooses to define word character in the same way that the `\w` character class is defined: a code point that is a member of one of the following classes: * `\p{Alphabetic}` * `\p{Join_Control}` * `\p{gc:Mark}` * `\p{gc:Decimal_Number}` * `\p{gc:Connector_Punctuation}` In particular, this differs slightly from the [prescription given in RL1.4](https://unicode.org/reports/tr18/#Simple_Word_Boundaries) but is permissible according to [UTS#18 Annex C](https://unicode.org/reports/tr18/#Compatibility_Properties). Namely, it is convenient and simpler to have `\w` and `\b` be in sync with one another. Finally, Unicode word boundaries can be disabled, which will cause ASCII word boundaries to be used instead. That is, `\b` is a Unicode word boundary while `(?-u)\b` is an ASCII-only word boundary. This can occasionally be beneficial if performance is important, since the implementation of Unicode word boundaries is currently suboptimal on non-ASCII text. ## RL1.5 Simple Loose Matches [UTS#18 RL1.5](https://unicode.org/reports/tr18/#Simple_Loose_Matches) The regex crate provides full support for case-insensitive matching in accordance with RL1.5. That is, it uses the "simple" case folding mapping. The "simple" mapping was chosen because of a key convenient property: every "simple" mapping is a mapping from exactly one code point to exactly one other code point. This makes case-insensitive matching of character classes, for example, straight-forward to implement. When case-insensitive mode is enabled (e.g., `(?i)[a]` is equivalent to `a|A`), then all characters classes are case folded as well. ## RL1.6 Line Boundaries [UTS#18 RL1.6](https://unicode.org/reports/tr18/#Line_Boundaries) The regex crate only provides support for recognizing the `\n` (`END OF LINE`) character as a line boundary by default. One can also opt into treating `\r\n|\r|\n` as a line boundary via CRLF mode. This choice was made mostly for implementation convenience, and to avoid performance cliffs that Unicode word boundaries are subject to. ## RL1.7 Code Points [UTS#18 RL1.7](https://unicode.org/reports/tr18/#Supplementary_Characters) The regex crate provides full support for Unicode code point matching. Namely, the fundamental atom of any match is always a single code point. Given Rust's strong ties to UTF-8, the following guarantees are also provided: * All matches are reported on valid UTF-8 code unit boundaries. That is, any match range returned by the public regex API is guaranteed to successfully slice the string that was searched. * By consequence of the above, it is impossible to match surrogate code points. No support for UTF-16 is provided, so this is never necessary. Note that when Unicode mode is disabled, the fundamental atom of matching is no longer a code point but a single byte. When Unicode mode is disabled, many Unicode features are disabled as well. For example, `(?-u)\pL` is not a valid regex but `\pL(?-u)\xFF` (matches any Unicode `Letter` followed by the literal byte `\xFF`) is, for example. regex-1.12.2/bench/README.md000064400000000000000000000001461046102023000133600ustar 00000000000000Benchmarks for this crate have been moved into the rebar project: https://github.com/BurntSushi/rebar regex-1.12.2/rustfmt.toml000064400000000000000000000000541046102023000134210ustar 00000000000000max_width = 79 use_small_heuristics = "max" regex-1.12.2/src/builders.rs000064400000000000000000003221661046102023000140010ustar 00000000000000#![allow(warnings)] // This module defines an internal builder that encapsulates all interaction // with meta::Regex construction, and then 4 public API builders that wrap // around it. The docs are essentially repeated on each of the 4 public // builders, with tweaks to the examples as needed. // // The reason why there are so many builders is partially because of a misstep // in the initial API design: the builder constructor takes in the pattern // strings instead of using the `build` method to accept the pattern strings. // This means `new` has a different signature for each builder. It probably // would have been nicer to to use one builder with `fn new()`, and then add // `build(pat)` and `build_many(pats)` constructors. // // The other reason is because I think the `bytes` module should probably // have its own builder type. That way, it is completely isolated from the // top-level API. // // If I could do it again, I'd probably have a `regex::Builder` and a // `regex::bytes::Builder`. Each would have `build` and `build_set` (or // `build_many`) methods for constructing a single pattern `Regex` and a // multi-pattern `RegexSet`, respectively. use alloc::{ string::{String, ToString}, sync::Arc, vec, vec::Vec, }; use regex_automata::{ meta, nfa::thompson::WhichCaptures, util::syntax, MatchKind, }; use crate::error::Error; /// A builder for constructing a `Regex`, `bytes::Regex`, `RegexSet` or a /// `bytes::RegexSet`. /// /// This is essentially the implementation of the four different builder types /// in the public API: `RegexBuilder`, `bytes::RegexBuilder`, `RegexSetBuilder` /// and `bytes::RegexSetBuilder`. #[derive(Clone, Debug)] struct Builder { pats: Vec, metac: meta::Config, syntaxc: syntax::Config, } impl Default for Builder { fn default() -> Builder { let metac = meta::Config::new() .nfa_size_limit(Some(10 * (1 << 20))) .hybrid_cache_capacity(2 * (1 << 20)); Builder { pats: vec![], metac, syntaxc: syntax::Config::default() } } } impl Builder { fn new(patterns: I) -> Builder where S: AsRef, I: IntoIterator, { let mut b = Builder::default(); b.pats.extend(patterns.into_iter().map(|p| p.as_ref().to_string())); b } fn build_one_string(&self) -> Result { assert_eq!(1, self.pats.len()); let metac = self .metac .clone() .match_kind(MatchKind::LeftmostFirst) .utf8_empty(true); let syntaxc = self.syntaxc.clone().utf8(true); let pattern = Arc::from(self.pats[0].as_str()); meta::Builder::new() .configure(metac) .syntax(syntaxc) .build(&pattern) .map(|meta| crate::Regex { meta, pattern }) .map_err(Error::from_meta_build_error) } fn build_one_bytes(&self) -> Result { assert_eq!(1, self.pats.len()); let metac = self .metac .clone() .match_kind(MatchKind::LeftmostFirst) .utf8_empty(false); let syntaxc = self.syntaxc.clone().utf8(false); let pattern = Arc::from(self.pats[0].as_str()); meta::Builder::new() .configure(metac) .syntax(syntaxc) .build(&pattern) .map(|meta| crate::bytes::Regex { meta, pattern }) .map_err(Error::from_meta_build_error) } fn build_many_string(&self) -> Result { let metac = self .metac .clone() .match_kind(MatchKind::All) .utf8_empty(true) .which_captures(WhichCaptures::None); let syntaxc = self.syntaxc.clone().utf8(true); let patterns = Arc::from(self.pats.as_slice()); meta::Builder::new() .configure(metac) .syntax(syntaxc) .build_many(&patterns) .map(|meta| crate::RegexSet { meta, patterns }) .map_err(Error::from_meta_build_error) } fn build_many_bytes(&self) -> Result { let metac = self .metac .clone() .match_kind(MatchKind::All) .utf8_empty(false) .which_captures(WhichCaptures::None); let syntaxc = self.syntaxc.clone().utf8(false); let patterns = Arc::from(self.pats.as_slice()); meta::Builder::new() .configure(metac) .syntax(syntaxc) .build_many(&patterns) .map(|meta| crate::bytes::RegexSet { meta, patterns }) .map_err(Error::from_meta_build_error) } fn case_insensitive(&mut self, yes: bool) -> &mut Builder { self.syntaxc = self.syntaxc.case_insensitive(yes); self } fn multi_line(&mut self, yes: bool) -> &mut Builder { self.syntaxc = self.syntaxc.multi_line(yes); self } fn dot_matches_new_line(&mut self, yes: bool) -> &mut Builder { self.syntaxc = self.syntaxc.dot_matches_new_line(yes); self } fn crlf(&mut self, yes: bool) -> &mut Builder { self.syntaxc = self.syntaxc.crlf(yes); self } fn line_terminator(&mut self, byte: u8) -> &mut Builder { self.metac = self.metac.clone().line_terminator(byte); self.syntaxc = self.syntaxc.line_terminator(byte); self } fn swap_greed(&mut self, yes: bool) -> &mut Builder { self.syntaxc = self.syntaxc.swap_greed(yes); self } fn ignore_whitespace(&mut self, yes: bool) -> &mut Builder { self.syntaxc = self.syntaxc.ignore_whitespace(yes); self } fn unicode(&mut self, yes: bool) -> &mut Builder { self.syntaxc = self.syntaxc.unicode(yes); self } fn octal(&mut self, yes: bool) -> &mut Builder { self.syntaxc = self.syntaxc.octal(yes); self } fn size_limit(&mut self, limit: usize) -> &mut Builder { self.metac = self.metac.clone().nfa_size_limit(Some(limit)); self } fn dfa_size_limit(&mut self, limit: usize) -> &mut Builder { self.metac = self.metac.clone().hybrid_cache_capacity(limit); self } fn nest_limit(&mut self, limit: u32) -> &mut Builder { self.syntaxc = self.syntaxc.nest_limit(limit); self } } pub(crate) mod string { use crate::{error::Error, Regex, RegexSet}; use super::Builder; /// A configurable builder for a [`Regex`]. /// /// This builder can be used to programmatically set flags such as `i` /// (case insensitive) and `x` (for verbose mode). This builder can also be /// used to configure things like the line terminator and a size limit on /// the compiled regular expression. #[derive(Clone, Debug)] pub struct RegexBuilder { builder: Builder, } impl RegexBuilder { /// Create a new builder with a default configuration for the given /// pattern. /// /// If the pattern is invalid or exceeds the configured size limits, /// then an error will be returned when [`RegexBuilder::build`] is /// called. pub fn new(pattern: &str) -> RegexBuilder { RegexBuilder { builder: Builder::new([pattern]) } } /// Compiles the pattern given to `RegexBuilder::new` with the /// configuration set on this builder. /// /// If the pattern isn't a valid regex or if a configured size limit /// was exceeded, then an error is returned. pub fn build(&self) -> Result { self.builder.build_one_string() } /// This configures Unicode mode for the entire pattern. /// /// Enabling Unicode mode does a number of things: /// /// * Most fundamentally, it causes the fundamental atom of matching /// to be a single codepoint. When Unicode mode is disabled, it's a /// single byte. For example, when Unicode mode is enabled, `.` will /// match `๐Ÿ’ฉ` once, where as it will match 4 times when Unicode mode /// is disabled. (Since the UTF-8 encoding of `๐Ÿ’ฉ` is 4 bytes long.) /// * Case insensitive matching uses Unicode simple case folding rules. /// * Unicode character classes like `\p{Letter}` and `\p{Greek}` are /// available. /// * Perl character classes are Unicode aware. That is, `\w`, `\s` and /// `\d`. /// * The word boundary assertions, `\b` and `\B`, use the Unicode /// definition of a word character. /// /// Note that if Unicode mode is disabled, then the regex will fail to /// compile if it could match invalid UTF-8. For example, when Unicode /// mode is disabled, then since `.` matches any byte (except for /// `\n`), then it can match invalid UTF-8 and thus building a regex /// from it will fail. Another example is `\w` and `\W`. Since `\w` can /// only match ASCII bytes when Unicode mode is disabled, it's allowed. /// But `\W` can match more than ASCII bytes, including invalid UTF-8, /// and so it is not allowed. This restriction can be lifted only by /// using a [`bytes::Regex`](crate::bytes::Regex). /// /// For more details on the Unicode support in this crate, see the /// [Unicode section](crate#unicode) in this crate's top-level /// documentation. /// /// The default for this is `true`. /// /// # Example /// /// ``` /// use regex::RegexBuilder; /// /// let re = RegexBuilder::new(r"\w") /// .unicode(false) /// .build() /// .unwrap(); /// // Normally greek letters would be included in \w, but since /// // Unicode mode is disabled, it only matches ASCII letters. /// assert!(!re.is_match("ฮด")); /// /// let re = RegexBuilder::new(r"s") /// .case_insensitive(true) /// .unicode(false) /// .build() /// .unwrap(); /// // Normally 'ลฟ' is included when searching for 's' case /// // insensitively due to Unicode's simple case folding rules. But /// // when Unicode mode is disabled, only ASCII case insensitive rules /// // are used. /// assert!(!re.is_match("ลฟ")); /// ``` pub fn unicode(&mut self, yes: bool) -> &mut RegexBuilder { self.builder.unicode(yes); self } /// This configures whether to enable case insensitive matching for the /// entire pattern. /// /// This setting can also be configured using the inline flag `i` /// in the pattern. For example, `(?i:foo)` matches `foo` case /// insensitively while `(?-i:foo)` matches `foo` case sensitively. /// /// The default for this is `false`. /// /// # Example /// /// ``` /// use regex::RegexBuilder; /// /// let re = RegexBuilder::new(r"foo(?-i:bar)quux") /// .case_insensitive(true) /// .build() /// .unwrap(); /// assert!(re.is_match("FoObarQuUx")); /// // Even though case insensitive matching is enabled in the builder, /// // it can be locally disabled within the pattern. In this case, /// // `bar` is matched case sensitively. /// assert!(!re.is_match("fooBARquux")); /// ``` pub fn case_insensitive(&mut self, yes: bool) -> &mut RegexBuilder { self.builder.case_insensitive(yes); self } /// This configures multi-line mode for the entire pattern. /// /// Enabling multi-line mode changes the behavior of the `^` and `$` /// anchor assertions. Instead of only matching at the beginning and /// end of a haystack, respectively, multi-line mode causes them to /// match at the beginning and end of a line *in addition* to the /// beginning and end of a haystack. More precisely, `^` will match at /// the position immediately following a `\n` and `$` will match at the /// position immediately preceding a `\n`. /// /// The behavior of this option can be impacted by other settings too: /// /// * The [`RegexBuilder::line_terminator`] option changes `\n` above /// to any ASCII byte. /// * The [`RegexBuilder::crlf`] option changes the line terminator to /// be either `\r` or `\n`, but never at the position between a `\r` /// and `\n`. /// /// This setting can also be configured using the inline flag `m` in /// the pattern. /// /// The default for this is `false`. /// /// # Example /// /// ``` /// use regex::RegexBuilder; /// /// let re = RegexBuilder::new(r"^foo$") /// .multi_line(true) /// .build() /// .unwrap(); /// assert_eq!(Some(1..4), re.find("\nfoo\n").map(|m| m.range())); /// ``` pub fn multi_line(&mut self, yes: bool) -> &mut RegexBuilder { self.builder.multi_line(yes); self } /// This configures dot-matches-new-line mode for the entire pattern. /// /// Perhaps surprisingly, the default behavior for `.` is not to match /// any character, but rather, to match any character except for the /// line terminator (which is `\n` by default). When this mode is /// enabled, the behavior changes such that `.` truly matches any /// character. /// /// This setting can also be configured using the inline flag `s` in /// the pattern. For example, `(?s:.)` and `\p{any}` are equivalent /// regexes. /// /// The default for this is `false`. /// /// # Example /// /// ``` /// use regex::RegexBuilder; /// /// let re = RegexBuilder::new(r"foo.bar") /// .dot_matches_new_line(true) /// .build() /// .unwrap(); /// let hay = "foo\nbar"; /// assert_eq!(Some("foo\nbar"), re.find(hay).map(|m| m.as_str())); /// ``` pub fn dot_matches_new_line( &mut self, yes: bool, ) -> &mut RegexBuilder { self.builder.dot_matches_new_line(yes); self } /// This configures CRLF mode for the entire pattern. /// /// When CRLF mode is enabled, both `\r` ("carriage return" or CR for /// short) and `\n` ("line feed" or LF for short) are treated as line /// terminators. This results in the following: /// /// * Unless dot-matches-new-line mode is enabled, `.` will now match /// any character except for `\n` and `\r`. /// * When multi-line mode is enabled, `^` will match immediately /// following a `\n` or a `\r`. Similarly, `$` will match immediately /// preceding a `\n` or a `\r`. Neither `^` nor `$` will ever match /// between `\r` and `\n`. /// /// This setting can also be configured using the inline flag `R` in /// the pattern. /// /// The default for this is `false`. /// /// # Example /// /// ``` /// use regex::RegexBuilder; /// /// let re = RegexBuilder::new(r"^foo$") /// .multi_line(true) /// .crlf(true) /// .build() /// .unwrap(); /// let hay = "\r\nfoo\r\n"; /// // If CRLF mode weren't enabled here, then '$' wouldn't match /// // immediately after 'foo', and thus no match would be found. /// assert_eq!(Some("foo"), re.find(hay).map(|m| m.as_str())); /// ``` /// /// This example demonstrates that `^` will never match at a position /// between `\r` and `\n`. (`$` will similarly not match between a `\r` /// and a `\n`.) /// /// ``` /// use regex::RegexBuilder; /// /// let re = RegexBuilder::new(r"^") /// .multi_line(true) /// .crlf(true) /// .build() /// .unwrap(); /// let hay = "\r\n\r\n"; /// let ranges: Vec<_> = re.find_iter(hay).map(|m| m.range()).collect(); /// assert_eq!(ranges, vec![0..0, 2..2, 4..4]); /// ``` pub fn crlf(&mut self, yes: bool) -> &mut RegexBuilder { self.builder.crlf(yes); self } /// Configures the line terminator to be used by the regex. /// /// The line terminator is relevant in two ways for a particular regex: /// /// * When dot-matches-new-line mode is *not* enabled (the default), /// then `.` will match any character except for the configured line /// terminator. /// * When multi-line mode is enabled (not the default), then `^` and /// `$` will match immediately after and before, respectively, a line /// terminator. /// /// In both cases, if CRLF mode is enabled in a particular context, /// then it takes precedence over any configured line terminator. /// /// This option cannot be configured from within the pattern. /// /// The default line terminator is `\n`. /// /// # Example /// /// This shows how to treat the NUL byte as a line terminator. This can /// be a useful heuristic when searching binary data. /// /// ``` /// use regex::RegexBuilder; /// /// let re = RegexBuilder::new(r"^foo$") /// .multi_line(true) /// .line_terminator(b'\x00') /// .build() /// .unwrap(); /// let hay = "\x00foo\x00"; /// assert_eq!(Some(1..4), re.find(hay).map(|m| m.range())); /// ``` /// /// This example shows that the behavior of `.` is impacted by this /// setting as well: /// /// ``` /// use regex::RegexBuilder; /// /// let re = RegexBuilder::new(r".") /// .line_terminator(b'\x00') /// .build() /// .unwrap(); /// assert!(re.is_match("\n")); /// assert!(!re.is_match("\x00")); /// ``` /// /// This shows that building a regex will fail if the byte given /// is not ASCII and the pattern could result in matching invalid /// UTF-8. This is because any singular non-ASCII byte is not valid /// UTF-8, and it is not permitted for a [`Regex`] to match invalid /// UTF-8. (It is permissible to use a non-ASCII byte when building a /// [`bytes::Regex`](crate::bytes::Regex).) /// /// ``` /// use regex::RegexBuilder; /// /// assert!(RegexBuilder::new(r".").line_terminator(0x80).build().is_err()); /// // Note that using a non-ASCII byte isn't enough on its own to /// // cause regex compilation to fail. You actually have to make use /// // of it in the regex in a way that leads to matching invalid /// // UTF-8. If you don't, then regex compilation will succeed! /// assert!(RegexBuilder::new(r"a").line_terminator(0x80).build().is_ok()); /// ``` pub fn line_terminator(&mut self, byte: u8) -> &mut RegexBuilder { self.builder.line_terminator(byte); self } /// This configures swap-greed mode for the entire pattern. /// /// When swap-greed mode is enabled, patterns like `a+` will become /// non-greedy and patterns like `a+?` will become greedy. In other /// words, the meanings of `a+` and `a+?` are switched. /// /// This setting can also be configured using the inline flag `U` in /// the pattern. /// /// The default for this is `false`. /// /// # Example /// /// ``` /// use regex::RegexBuilder; /// /// let re = RegexBuilder::new(r"a+") /// .swap_greed(true) /// .build() /// .unwrap(); /// assert_eq!(Some("a"), re.find("aaa").map(|m| m.as_str())); /// ``` pub fn swap_greed(&mut self, yes: bool) -> &mut RegexBuilder { self.builder.swap_greed(yes); self } /// This configures verbose mode for the entire pattern. /// /// When enabled, whitespace will treated as insignificant in the /// pattern and `#` can be used to start a comment until the next new /// line. /// /// Normally, in most places in a pattern, whitespace is treated /// literally. For example ` +` will match one or more ASCII whitespace /// characters. /// /// When verbose mode is enabled, `\#` can be used to match a literal /// `#` and `\ ` can be used to match a literal ASCII whitespace /// character. /// /// Verbose mode is useful for permitting regexes to be formatted and /// broken up more nicely. This may make them more easily readable. /// /// This setting can also be configured using the inline flag `x` in /// the pattern. /// /// The default for this is `false`. /// /// # Example /// /// ``` /// use regex::RegexBuilder; /// /// let pat = r" /// \b /// (?\p{Uppercase}\w*) # always start with uppercase letter /// [\s--\n]+ # whitespace should separate names /// (?: # middle name can be an initial! /// (?:(?\p{Uppercase})\.|(?\p{Uppercase}\w*)) /// [\s--\n]+ /// )? /// (?\p{Uppercase}\w*) /// \b /// "; /// let re = RegexBuilder::new(pat) /// .ignore_whitespace(true) /// .build() /// .unwrap(); /// /// let caps = re.captures("Harry Potter").unwrap(); /// assert_eq!("Harry", &caps["first"]); /// assert_eq!("Potter", &caps["last"]); /// /// let caps = re.captures("Harry J. Potter").unwrap(); /// assert_eq!("Harry", &caps["first"]); /// // Since a middle name/initial isn't required for an overall match, /// // we can't assume that 'initial' or 'middle' will be populated! /// assert_eq!(Some("J"), caps.name("initial").map(|m| m.as_str())); /// assert_eq!(None, caps.name("middle").map(|m| m.as_str())); /// assert_eq!("Potter", &caps["last"]); /// /// let caps = re.captures("Harry James Potter").unwrap(); /// assert_eq!("Harry", &caps["first"]); /// // Since a middle name/initial isn't required for an overall match, /// // we can't assume that 'initial' or 'middle' will be populated! /// assert_eq!(None, caps.name("initial").map(|m| m.as_str())); /// assert_eq!(Some("James"), caps.name("middle").map(|m| m.as_str())); /// assert_eq!("Potter", &caps["last"]); /// ``` pub fn ignore_whitespace(&mut self, yes: bool) -> &mut RegexBuilder { self.builder.ignore_whitespace(yes); self } /// This configures octal mode for the entire pattern. /// /// Octal syntax is a little-known way of uttering Unicode codepoints /// in a pattern. For example, `a`, `\x61`, `\u0061` and `\141` are all /// equivalent patterns, where the last example shows octal syntax. /// /// While supporting octal syntax isn't in and of itself a problem, /// it does make good error messages harder. That is, in PCRE based /// regex engines, syntax like `\1` invokes a backreference, which is /// explicitly unsupported this library. However, many users expect /// backreferences to be supported. Therefore, when octal support /// is disabled, the error message will explicitly mention that /// backreferences aren't supported. /// /// The default for this is `false`. /// /// # Example /// /// ``` /// use regex::RegexBuilder; /// /// // Normally this pattern would not compile, with an error message /// // about backreferences not being supported. But with octal mode /// // enabled, octal escape sequences work. /// let re = RegexBuilder::new(r"\141") /// .octal(true) /// .build() /// .unwrap(); /// assert!(re.is_match("a")); /// ``` pub fn octal(&mut self, yes: bool) -> &mut RegexBuilder { self.builder.octal(yes); self } /// Sets the approximate size limit, in bytes, of the compiled regex. /// /// This roughly corresponds to the number of heap memory, in /// bytes, occupied by a single regex. If the regex would otherwise /// approximately exceed this limit, then compiling that regex will /// fail. /// /// The main utility of a method like this is to avoid compiling /// regexes that use an unexpected amount of resources, such as /// time and memory. Even if the memory usage of a large regex is /// acceptable, its search time may not be. Namely, worst case time /// complexity for search is `O(m * n)`, where `m ~ len(pattern)` and /// `n ~ len(haystack)`. That is, search time depends, in part, on the /// size of the compiled regex. This means that putting a limit on the /// size of the regex limits how much a regex can impact search time. /// /// For more information about regex size limits, see the section on /// [untrusted inputs](crate#untrusted-input) in the top-level crate /// documentation. /// /// The default for this is some reasonable number that permits most /// patterns to compile successfully. /// /// # Example /// /// ``` /// # if !cfg!(target_pointer_width = "64") { return; } // see #1041 /// use regex::RegexBuilder; /// /// // It may surprise you how big some seemingly small patterns can /// // be! Since \w is Unicode aware, this generates a regex that can /// // match approximately 140,000 distinct codepoints. /// assert!(RegexBuilder::new(r"\w").size_limit(45_000).build().is_err()); /// ``` pub fn size_limit(&mut self, bytes: usize) -> &mut RegexBuilder { self.builder.size_limit(bytes); self } /// Set the approximate capacity, in bytes, of the cache of transitions /// used by the lazy DFA. /// /// While the lazy DFA isn't always used, in tends to be the most /// commonly use regex engine in default configurations. It tends to /// adopt the performance profile of a fully build DFA, but without the /// downside of taking worst case exponential time to build. /// /// The downside is that it needs to keep a cache of transitions and /// states that are built while running a search, and this cache /// can fill up. When it fills up, the cache will reset itself. Any /// previously generated states and transitions will then need to be /// re-generated. If this happens too many times, then this library /// will bail out of using the lazy DFA and switch to a different regex /// engine. /// /// If your regex provokes this particular downside of the lazy DFA, /// then it may be beneficial to increase its cache capacity. This will /// potentially reduce the frequency of cache resetting (ideally to /// `0`). While it won't fix all potential performance problems with /// the lazy DFA, increasing the cache capacity does fix some. /// /// There is no easy way to determine, a priori, whether increasing /// this cache capacity will help. In general, the larger your regex, /// the more cache it's likely to use. But that isn't an ironclad rule. /// For example, a regex like `[01]*1[01]{N}` would normally produce a /// fully build DFA that is exponential in size with respect to `N`. /// The lazy DFA will prevent exponential space blow-up, but it cache /// is likely to fill up, even when it's large and even for smallish /// values of `N`. /// /// If you aren't sure whether this helps or not, it is sensible to /// set this to some arbitrarily large number in testing, such as /// `usize::MAX`. Namely, this represents the amount of capacity that /// *may* be used. It's probably not a good idea to use `usize::MAX` in /// production though, since it implies there are no controls on heap /// memory used by this library during a search. In effect, set it to /// whatever you're willing to allocate for a single regex search. pub fn dfa_size_limit(&mut self, bytes: usize) -> &mut RegexBuilder { self.builder.dfa_size_limit(bytes); self } /// Set the nesting limit for this parser. /// /// The nesting limit controls how deep the abstract syntax tree is /// allowed to be. If the AST exceeds the given limit (e.g., with too /// many nested groups), then an error is returned by the parser. /// /// The purpose of this limit is to act as a heuristic to prevent stack /// overflow for consumers that do structural induction on an AST using /// explicit recursion. While this crate never does this (instead using /// constant stack space and moving the call stack to the heap), other /// crates may. /// /// This limit is not checked until the entire AST is parsed. /// Therefore, if callers want to put a limit on the amount of heap /// space used, then they should impose a limit on the length, in /// bytes, of the concrete pattern string. In particular, this is /// viable since this parser implementation will limit itself to heap /// space proportional to the length of the pattern string. See also /// the [untrusted inputs](crate#untrusted-input) section in the /// top-level crate documentation for more information about this. /// /// Note that a nest limit of `0` will return a nest limit error for /// most patterns but not all. For example, a nest limit of `0` permits /// `a` but not `ab`, since `ab` requires an explicit concatenation, /// which results in a nest depth of `1`. In general, a nest limit is /// not something that manifests in an obvious way in the concrete /// syntax, therefore, it should not be used in a granular way. /// /// # Example /// /// ``` /// use regex::RegexBuilder; /// /// assert!(RegexBuilder::new(r"a").nest_limit(0).build().is_ok()); /// assert!(RegexBuilder::new(r"ab").nest_limit(0).build().is_err()); /// ``` pub fn nest_limit(&mut self, limit: u32) -> &mut RegexBuilder { self.builder.nest_limit(limit); self } } /// A configurable builder for a [`RegexSet`]. /// /// This builder can be used to programmatically set flags such as /// `i` (case insensitive) and `x` (for verbose mode). This builder /// can also be used to configure things like the line terminator /// and a size limit on the compiled regular expression. #[derive(Clone, Debug)] pub struct RegexSetBuilder { builder: Builder, } impl RegexSetBuilder { /// Create a new builder with a default configuration for the given /// patterns. /// /// If the patterns are invalid or exceed the configured size limits, /// then an error will be returned when [`RegexSetBuilder::build`] is /// called. pub fn new(patterns: I) -> RegexSetBuilder where I: IntoIterator, S: AsRef, { RegexSetBuilder { builder: Builder::new(patterns) } } /// Compiles the patterns given to `RegexSetBuilder::new` with the /// configuration set on this builder. /// /// If the patterns aren't valid regexes or if a configured size limit /// was exceeded, then an error is returned. pub fn build(&self) -> Result { self.builder.build_many_string() } /// This configures Unicode mode for the all of the patterns. /// /// Enabling Unicode mode does a number of things: /// /// * Most fundamentally, it causes the fundamental atom of matching /// to be a single codepoint. When Unicode mode is disabled, it's a /// single byte. For example, when Unicode mode is enabled, `.` will /// match `๐Ÿ’ฉ` once, where as it will match 4 times when Unicode mode /// is disabled. (Since the UTF-8 encoding of `๐Ÿ’ฉ` is 4 bytes long.) /// * Case insensitive matching uses Unicode simple case folding rules. /// * Unicode character classes like `\p{Letter}` and `\p{Greek}` are /// available. /// * Perl character classes are Unicode aware. That is, `\w`, `\s` and /// `\d`. /// * The word boundary assertions, `\b` and `\B`, use the Unicode /// definition of a word character. /// /// Note that if Unicode mode is disabled, then the regex will fail to /// compile if it could match invalid UTF-8. For example, when Unicode /// mode is disabled, then since `.` matches any byte (except for /// `\n`), then it can match invalid UTF-8 and thus building a regex /// from it will fail. Another example is `\w` and `\W`. Since `\w` can /// only match ASCII bytes when Unicode mode is disabled, it's allowed. /// But `\W` can match more than ASCII bytes, including invalid UTF-8, /// and so it is not allowed. This restriction can be lifted only by /// using a [`bytes::RegexSet`](crate::bytes::RegexSet). /// /// For more details on the Unicode support in this crate, see the /// [Unicode section](crate#unicode) in this crate's top-level /// documentation. /// /// The default for this is `true`. /// /// # Example /// /// ``` /// use regex::RegexSetBuilder; /// /// let re = RegexSetBuilder::new([r"\w"]) /// .unicode(false) /// .build() /// .unwrap(); /// // Normally greek letters would be included in \w, but since /// // Unicode mode is disabled, it only matches ASCII letters. /// assert!(!re.is_match("ฮด")); /// /// let re = RegexSetBuilder::new([r"s"]) /// .case_insensitive(true) /// .unicode(false) /// .build() /// .unwrap(); /// // Normally 'ลฟ' is included when searching for 's' case /// // insensitively due to Unicode's simple case folding rules. But /// // when Unicode mode is disabled, only ASCII case insensitive rules /// // are used. /// assert!(!re.is_match("ลฟ")); /// ``` pub fn unicode(&mut self, yes: bool) -> &mut RegexSetBuilder { self.builder.unicode(yes); self } /// This configures whether to enable case insensitive matching for all /// of the patterns. /// /// This setting can also be configured using the inline flag `i` /// in the pattern. For example, `(?i:foo)` matches `foo` case /// insensitively while `(?-i:foo)` matches `foo` case sensitively. /// /// The default for this is `false`. /// /// # Example /// /// ``` /// use regex::RegexSetBuilder; /// /// let re = RegexSetBuilder::new([r"foo(?-i:bar)quux"]) /// .case_insensitive(true) /// .build() /// .unwrap(); /// assert!(re.is_match("FoObarQuUx")); /// // Even though case insensitive matching is enabled in the builder, /// // it can be locally disabled within the pattern. In this case, /// // `bar` is matched case sensitively. /// assert!(!re.is_match("fooBARquux")); /// ``` pub fn case_insensitive(&mut self, yes: bool) -> &mut RegexSetBuilder { self.builder.case_insensitive(yes); self } /// This configures multi-line mode for all of the patterns. /// /// Enabling multi-line mode changes the behavior of the `^` and `$` /// anchor assertions. Instead of only matching at the beginning and /// end of a haystack, respectively, multi-line mode causes them to /// match at the beginning and end of a line *in addition* to the /// beginning and end of a haystack. More precisely, `^` will match at /// the position immediately following a `\n` and `$` will match at the /// position immediately preceding a `\n`. /// /// The behavior of this option can be impacted by other settings too: /// /// * The [`RegexSetBuilder::line_terminator`] option changes `\n` /// above to any ASCII byte. /// * The [`RegexSetBuilder::crlf`] option changes the line terminator /// to be either `\r` or `\n`, but never at the position between a `\r` /// and `\n`. /// /// This setting can also be configured using the inline flag `m` in /// the pattern. /// /// The default for this is `false`. /// /// # Example /// /// ``` /// use regex::RegexSetBuilder; /// /// let re = RegexSetBuilder::new([r"^foo$"]) /// .multi_line(true) /// .build() /// .unwrap(); /// assert!(re.is_match("\nfoo\n")); /// ``` pub fn multi_line(&mut self, yes: bool) -> &mut RegexSetBuilder { self.builder.multi_line(yes); self } /// This configures dot-matches-new-line mode for the entire pattern. /// /// Perhaps surprisingly, the default behavior for `.` is not to match /// any character, but rather, to match any character except for the /// line terminator (which is `\n` by default). When this mode is /// enabled, the behavior changes such that `.` truly matches any /// character. /// /// This setting can also be configured using the inline flag `s` in /// the pattern. For example, `(?s:.)` and `\p{any}` are equivalent /// regexes. /// /// The default for this is `false`. /// /// # Example /// /// ``` /// use regex::RegexSetBuilder; /// /// let re = RegexSetBuilder::new([r"foo.bar"]) /// .dot_matches_new_line(true) /// .build() /// .unwrap(); /// let hay = "foo\nbar"; /// assert!(re.is_match(hay)); /// ``` pub fn dot_matches_new_line( &mut self, yes: bool, ) -> &mut RegexSetBuilder { self.builder.dot_matches_new_line(yes); self } /// This configures CRLF mode for all of the patterns. /// /// When CRLF mode is enabled, both `\r` ("carriage return" or CR for /// short) and `\n` ("line feed" or LF for short) are treated as line /// terminators. This results in the following: /// /// * Unless dot-matches-new-line mode is enabled, `.` will now match /// any character except for `\n` and `\r`. /// * When multi-line mode is enabled, `^` will match immediately /// following a `\n` or a `\r`. Similarly, `$` will match immediately /// preceding a `\n` or a `\r`. Neither `^` nor `$` will ever match /// between `\r` and `\n`. /// /// This setting can also be configured using the inline flag `R` in /// the pattern. /// /// The default for this is `false`. /// /// # Example /// /// ``` /// use regex::RegexSetBuilder; /// /// let re = RegexSetBuilder::new([r"^foo$"]) /// .multi_line(true) /// .crlf(true) /// .build() /// .unwrap(); /// let hay = "\r\nfoo\r\n"; /// // If CRLF mode weren't enabled here, then '$' wouldn't match /// // immediately after 'foo', and thus no match would be found. /// assert!(re.is_match(hay)); /// ``` /// /// This example demonstrates that `^` will never match at a position /// between `\r` and `\n`. (`$` will similarly not match between a `\r` /// and a `\n`.) /// /// ``` /// use regex::RegexSetBuilder; /// /// let re = RegexSetBuilder::new([r"^\n"]) /// .multi_line(true) /// .crlf(true) /// .build() /// .unwrap(); /// assert!(!re.is_match("\r\n")); /// ``` pub fn crlf(&mut self, yes: bool) -> &mut RegexSetBuilder { self.builder.crlf(yes); self } /// Configures the line terminator to be used by the regex. /// /// The line terminator is relevant in two ways for a particular regex: /// /// * When dot-matches-new-line mode is *not* enabled (the default), /// then `.` will match any character except for the configured line /// terminator. /// * When multi-line mode is enabled (not the default), then `^` and /// `$` will match immediately after and before, respectively, a line /// terminator. /// /// In both cases, if CRLF mode is enabled in a particular context, /// then it takes precedence over any configured line terminator. /// /// This option cannot be configured from within the pattern. /// /// The default line terminator is `\n`. /// /// # Example /// /// This shows how to treat the NUL byte as a line terminator. This can /// be a useful heuristic when searching binary data. /// /// ``` /// use regex::RegexSetBuilder; /// /// let re = RegexSetBuilder::new([r"^foo$"]) /// .multi_line(true) /// .line_terminator(b'\x00') /// .build() /// .unwrap(); /// let hay = "\x00foo\x00"; /// assert!(re.is_match(hay)); /// ``` /// /// This example shows that the behavior of `.` is impacted by this /// setting as well: /// /// ``` /// use regex::RegexSetBuilder; /// /// let re = RegexSetBuilder::new([r"."]) /// .line_terminator(b'\x00') /// .build() /// .unwrap(); /// assert!(re.is_match("\n")); /// assert!(!re.is_match("\x00")); /// ``` /// /// This shows that building a regex will fail if the byte given /// is not ASCII and the pattern could result in matching invalid /// UTF-8. This is because any singular non-ASCII byte is not valid /// UTF-8, and it is not permitted for a [`RegexSet`] to match invalid /// UTF-8. (It is permissible to use a non-ASCII byte when building a /// [`bytes::RegexSet`](crate::bytes::RegexSet).) /// /// ``` /// use regex::RegexSetBuilder; /// /// assert!( /// RegexSetBuilder::new([r"."]) /// .line_terminator(0x80) /// .build() /// .is_err() /// ); /// // Note that using a non-ASCII byte isn't enough on its own to /// // cause regex compilation to fail. You actually have to make use /// // of it in the regex in a way that leads to matching invalid /// // UTF-8. If you don't, then regex compilation will succeed! /// assert!( /// RegexSetBuilder::new([r"a"]) /// .line_terminator(0x80) /// .build() /// .is_ok() /// ); /// ``` pub fn line_terminator(&mut self, byte: u8) -> &mut RegexSetBuilder { self.builder.line_terminator(byte); self } /// This configures swap-greed mode for all of the patterns. /// /// When swap-greed mode is enabled, patterns like `a+` will become /// non-greedy and patterns like `a+?` will become greedy. In other /// words, the meanings of `a+` and `a+?` are switched. /// /// This setting can also be configured using the inline flag `U` in /// the pattern. /// /// Note that this is generally not useful for a `RegexSet` since a /// `RegexSet` can only report whether a pattern matches or not. Since /// greediness never impacts whether a match is found or not (only the /// offsets of the match), it follows that whether parts of a pattern /// are greedy or not doesn't matter for a `RegexSet`. /// /// The default for this is `false`. pub fn swap_greed(&mut self, yes: bool) -> &mut RegexSetBuilder { self.builder.swap_greed(yes); self } /// This configures verbose mode for all of the patterns. /// /// When enabled, whitespace will treated as insignificant in the /// pattern and `#` can be used to start a comment until the next new /// line. /// /// Normally, in most places in a pattern, whitespace is treated /// literally. For example ` +` will match one or more ASCII whitespace /// characters. /// /// When verbose mode is enabled, `\#` can be used to match a literal /// `#` and `\ ` can be used to match a literal ASCII whitespace /// character. /// /// Verbose mode is useful for permitting regexes to be formatted and /// broken up more nicely. This may make them more easily readable. /// /// This setting can also be configured using the inline flag `x` in /// the pattern. /// /// The default for this is `false`. /// /// # Example /// /// ``` /// use regex::RegexSetBuilder; /// /// let pat = r" /// \b /// (?\p{Uppercase}\w*) # always start with uppercase letter /// [\s--\n]+ # whitespace should separate names /// (?: # middle name can be an initial! /// (?:(?\p{Uppercase})\.|(?\p{Uppercase}\w*)) /// [\s--\n]+ /// )? /// (?\p{Uppercase}\w*) /// \b /// "; /// let re = RegexSetBuilder::new([pat]) /// .ignore_whitespace(true) /// .build() /// .unwrap(); /// assert!(re.is_match("Harry Potter")); /// assert!(re.is_match("Harry J. Potter")); /// assert!(re.is_match("Harry James Potter")); /// assert!(!re.is_match("harry J. Potter")); /// ``` pub fn ignore_whitespace( &mut self, yes: bool, ) -> &mut RegexSetBuilder { self.builder.ignore_whitespace(yes); self } /// This configures octal mode for all of the patterns. /// /// Octal syntax is a little-known way of uttering Unicode codepoints /// in a pattern. For example, `a`, `\x61`, `\u0061` and `\141` are all /// equivalent patterns, where the last example shows octal syntax. /// /// While supporting octal syntax isn't in and of itself a problem, /// it does make good error messages harder. That is, in PCRE based /// regex engines, syntax like `\1` invokes a backreference, which is /// explicitly unsupported this library. However, many users expect /// backreferences to be supported. Therefore, when octal support /// is disabled, the error message will explicitly mention that /// backreferences aren't supported. /// /// The default for this is `false`. /// /// # Example /// /// ``` /// use regex::RegexSetBuilder; /// /// // Normally this pattern would not compile, with an error message /// // about backreferences not being supported. But with octal mode /// // enabled, octal escape sequences work. /// let re = RegexSetBuilder::new([r"\141"]) /// .octal(true) /// .build() /// .unwrap(); /// assert!(re.is_match("a")); /// ``` pub fn octal(&mut self, yes: bool) -> &mut RegexSetBuilder { self.builder.octal(yes); self } /// Sets the approximate size limit, in bytes, of the compiled regex. /// /// This roughly corresponds to the number of heap memory, in /// bytes, occupied by a single regex. If the regex would otherwise /// approximately exceed this limit, then compiling that regex will /// fail. /// /// The main utility of a method like this is to avoid compiling /// regexes that use an unexpected amount of resources, such as /// time and memory. Even if the memory usage of a large regex is /// acceptable, its search time may not be. Namely, worst case time /// complexity for search is `O(m * n)`, where `m ~ len(pattern)` and /// `n ~ len(haystack)`. That is, search time depends, in part, on the /// size of the compiled regex. This means that putting a limit on the /// size of the regex limits how much a regex can impact search time. /// /// For more information about regex size limits, see the section on /// [untrusted inputs](crate#untrusted-input) in the top-level crate /// documentation. /// /// The default for this is some reasonable number that permits most /// patterns to compile successfully. /// /// # Example /// /// ``` /// # if !cfg!(target_pointer_width = "64") { return; } // see #1041 /// use regex::RegexSetBuilder; /// /// // It may surprise you how big some seemingly small patterns can /// // be! Since \w is Unicode aware, this generates a regex that can /// // match approximately 140,000 distinct codepoints. /// assert!( /// RegexSetBuilder::new([r"\w"]) /// .size_limit(45_000) /// .build() /// .is_err() /// ); /// ``` pub fn size_limit(&mut self, bytes: usize) -> &mut RegexSetBuilder { self.builder.size_limit(bytes); self } /// Set the approximate capacity, in bytes, of the cache of transitions /// used by the lazy DFA. /// /// While the lazy DFA isn't always used, in tends to be the most /// commonly use regex engine in default configurations. It tends to /// adopt the performance profile of a fully build DFA, but without the /// downside of taking worst case exponential time to build. /// /// The downside is that it needs to keep a cache of transitions and /// states that are built while running a search, and this cache /// can fill up. When it fills up, the cache will reset itself. Any /// previously generated states and transitions will then need to be /// re-generated. If this happens too many times, then this library /// will bail out of using the lazy DFA and switch to a different regex /// engine. /// /// If your regex provokes this particular downside of the lazy DFA, /// then it may be beneficial to increase its cache capacity. This will /// potentially reduce the frequency of cache resetting (ideally to /// `0`). While it won't fix all potential performance problems with /// the lazy DFA, increasing the cache capacity does fix some. /// /// There is no easy way to determine, a priori, whether increasing /// this cache capacity will help. In general, the larger your regex, /// the more cache it's likely to use. But that isn't an ironclad rule. /// For example, a regex like `[01]*1[01]{N}` would normally produce a /// fully build DFA that is exponential in size with respect to `N`. /// The lazy DFA will prevent exponential space blow-up, but it cache /// is likely to fill up, even when it's large and even for smallish /// values of `N`. /// /// If you aren't sure whether this helps or not, it is sensible to /// set this to some arbitrarily large number in testing, such as /// `usize::MAX`. Namely, this represents the amount of capacity that /// *may* be used. It's probably not a good idea to use `usize::MAX` in /// production though, since it implies there are no controls on heap /// memory used by this library during a search. In effect, set it to /// whatever you're willing to allocate for a single regex search. pub fn dfa_size_limit( &mut self, bytes: usize, ) -> &mut RegexSetBuilder { self.builder.dfa_size_limit(bytes); self } /// Set the nesting limit for this parser. /// /// The nesting limit controls how deep the abstract syntax tree is /// allowed to be. If the AST exceeds the given limit (e.g., with too /// many nested groups), then an error is returned by the parser. /// /// The purpose of this limit is to act as a heuristic to prevent stack /// overflow for consumers that do structural induction on an AST using /// explicit recursion. While this crate never does this (instead using /// constant stack space and moving the call stack to the heap), other /// crates may. /// /// This limit is not checked until the entire AST is parsed. /// Therefore, if callers want to put a limit on the amount of heap /// space used, then they should impose a limit on the length, in /// bytes, of the concrete pattern string. In particular, this is /// viable since this parser implementation will limit itself to heap /// space proportional to the length of the pattern string. See also /// the [untrusted inputs](crate#untrusted-input) section in the /// top-level crate documentation for more information about this. /// /// Note that a nest limit of `0` will return a nest limit error for /// most patterns but not all. For example, a nest limit of `0` permits /// `a` but not `ab`, since `ab` requires an explicit concatenation, /// which results in a nest depth of `1`. In general, a nest limit is /// not something that manifests in an obvious way in the concrete /// syntax, therefore, it should not be used in a granular way. /// /// # Example /// /// ``` /// use regex::RegexSetBuilder; /// /// assert!(RegexSetBuilder::new([r"a"]).nest_limit(0).build().is_ok()); /// assert!(RegexSetBuilder::new([r"ab"]).nest_limit(0).build().is_err()); /// ``` pub fn nest_limit(&mut self, limit: u32) -> &mut RegexSetBuilder { self.builder.nest_limit(limit); self } } } pub(crate) mod bytes { use crate::{ bytes::{Regex, RegexSet}, error::Error, }; use super::Builder; /// A configurable builder for a [`Regex`]. /// /// This builder can be used to programmatically set flags such as `i` /// (case insensitive) and `x` (for verbose mode). This builder can also be /// used to configure things like the line terminator and a size limit on /// the compiled regular expression. #[derive(Clone, Debug)] pub struct RegexBuilder { builder: Builder, } impl RegexBuilder { /// Create a new builder with a default configuration for the given /// pattern. /// /// If the pattern is invalid or exceeds the configured size limits, /// then an error will be returned when [`RegexBuilder::build`] is /// called. pub fn new(pattern: &str) -> RegexBuilder { RegexBuilder { builder: Builder::new([pattern]) } } /// Compiles the pattern given to `RegexBuilder::new` with the /// configuration set on this builder. /// /// If the pattern isn't a valid regex or if a configured size limit /// was exceeded, then an error is returned. pub fn build(&self) -> Result { self.builder.build_one_bytes() } /// This configures Unicode mode for the entire pattern. /// /// Enabling Unicode mode does a number of things: /// /// * Most fundamentally, it causes the fundamental atom of matching /// to be a single codepoint. When Unicode mode is disabled, it's a /// single byte. For example, when Unicode mode is enabled, `.` will /// match `๐Ÿ’ฉ` once, where as it will match 4 times when Unicode mode /// is disabled. (Since the UTF-8 encoding of `๐Ÿ’ฉ` is 4 bytes long.) /// * Case insensitive matching uses Unicode simple case folding rules. /// * Unicode character classes like `\p{Letter}` and `\p{Greek}` are /// available. /// * Perl character classes are Unicode aware. That is, `\w`, `\s` and /// `\d`. /// * The word boundary assertions, `\b` and `\B`, use the Unicode /// definition of a word character. /// /// Note that unlike the top-level `Regex` for searching `&str`, it /// is permitted to disable Unicode mode even if the resulting pattern /// could match invalid UTF-8. For example, `(?-u:.)` is not a valid /// pattern for a top-level `Regex`, but is valid for a `bytes::Regex`. /// /// For more details on the Unicode support in this crate, see the /// [Unicode section](crate#unicode) in this crate's top-level /// documentation. /// /// The default for this is `true`. /// /// # Example /// /// ``` /// use regex::bytes::RegexBuilder; /// /// let re = RegexBuilder::new(r"\w") /// .unicode(false) /// .build() /// .unwrap(); /// // Normally greek letters would be included in \w, but since /// // Unicode mode is disabled, it only matches ASCII letters. /// assert!(!re.is_match("ฮด".as_bytes())); /// /// let re = RegexBuilder::new(r"s") /// .case_insensitive(true) /// .unicode(false) /// .build() /// .unwrap(); /// // Normally 'ลฟ' is included when searching for 's' case /// // insensitively due to Unicode's simple case folding rules. But /// // when Unicode mode is disabled, only ASCII case insensitive rules /// // are used. /// assert!(!re.is_match("ลฟ".as_bytes())); /// ``` /// /// Since this builder is for constructing a [`bytes::Regex`](Regex), /// one can disable Unicode mode even if it would match invalid UTF-8: /// /// ``` /// use regex::bytes::RegexBuilder; /// /// let re = RegexBuilder::new(r".") /// .unicode(false) /// .build() /// .unwrap(); /// // Normally greek letters would be included in \w, but since /// // Unicode mode is disabled, it only matches ASCII letters. /// assert!(re.is_match(b"\xFF")); /// ``` pub fn unicode(&mut self, yes: bool) -> &mut RegexBuilder { self.builder.unicode(yes); self } /// This configures whether to enable case insensitive matching for the /// entire pattern. /// /// This setting can also be configured using the inline flag `i` /// in the pattern. For example, `(?i:foo)` matches `foo` case /// insensitively while `(?-i:foo)` matches `foo` case sensitively. /// /// The default for this is `false`. /// /// # Example /// /// ``` /// use regex::bytes::RegexBuilder; /// /// let re = RegexBuilder::new(r"foo(?-i:bar)quux") /// .case_insensitive(true) /// .build() /// .unwrap(); /// assert!(re.is_match(b"FoObarQuUx")); /// // Even though case insensitive matching is enabled in the builder, /// // it can be locally disabled within the pattern. In this case, /// // `bar` is matched case sensitively. /// assert!(!re.is_match(b"fooBARquux")); /// ``` pub fn case_insensitive(&mut self, yes: bool) -> &mut RegexBuilder { self.builder.case_insensitive(yes); self } /// This configures multi-line mode for the entire pattern. /// /// Enabling multi-line mode changes the behavior of the `^` and `$` /// anchor assertions. Instead of only matching at the beginning and /// end of a haystack, respectively, multi-line mode causes them to /// match at the beginning and end of a line *in addition* to the /// beginning and end of a haystack. More precisely, `^` will match at /// the position immediately following a `\n` and `$` will match at the /// position immediately preceding a `\n`. /// /// The behavior of this option can be impacted by other settings too: /// /// * The [`RegexBuilder::line_terminator`] option changes `\n` above /// to any ASCII byte. /// * The [`RegexBuilder::crlf`] option changes the line terminator to /// be either `\r` or `\n`, but never at the position between a `\r` /// and `\n`. /// /// This setting can also be configured using the inline flag `m` in /// the pattern. /// /// The default for this is `false`. /// /// # Example /// /// ``` /// use regex::bytes::RegexBuilder; /// /// let re = RegexBuilder::new(r"^foo$") /// .multi_line(true) /// .build() /// .unwrap(); /// assert_eq!(Some(1..4), re.find(b"\nfoo\n").map(|m| m.range())); /// ``` pub fn multi_line(&mut self, yes: bool) -> &mut RegexBuilder { self.builder.multi_line(yes); self } /// This configures dot-matches-new-line mode for the entire pattern. /// /// Perhaps surprisingly, the default behavior for `.` is not to match /// any character, but rather, to match any character except for the /// line terminator (which is `\n` by default). When this mode is /// enabled, the behavior changes such that `.` truly matches any /// character. /// /// This setting can also be configured using the inline flag `s` in /// the pattern. For example, `(?s:.)` and `\p{any}` are equivalent /// regexes. /// /// The default for this is `false`. /// /// # Example /// /// ``` /// use regex::bytes::RegexBuilder; /// /// let re = RegexBuilder::new(r"foo.bar") /// .dot_matches_new_line(true) /// .build() /// .unwrap(); /// let hay = b"foo\nbar"; /// assert_eq!(Some(&b"foo\nbar"[..]), re.find(hay).map(|m| m.as_bytes())); /// ``` pub fn dot_matches_new_line( &mut self, yes: bool, ) -> &mut RegexBuilder { self.builder.dot_matches_new_line(yes); self } /// This configures CRLF mode for the entire pattern. /// /// When CRLF mode is enabled, both `\r` ("carriage return" or CR for /// short) and `\n` ("line feed" or LF for short) are treated as line /// terminators. This results in the following: /// /// * Unless dot-matches-new-line mode is enabled, `.` will now match /// any character except for `\n` and `\r`. /// * When multi-line mode is enabled, `^` will match immediately /// following a `\n` or a `\r`. Similarly, `$` will match immediately /// preceding a `\n` or a `\r`. Neither `^` nor `$` will ever match /// between `\r` and `\n`. /// /// This setting can also be configured using the inline flag `R` in /// the pattern. /// /// The default for this is `false`. /// /// # Example /// /// ``` /// use regex::bytes::RegexBuilder; /// /// let re = RegexBuilder::new(r"^foo$") /// .multi_line(true) /// .crlf(true) /// .build() /// .unwrap(); /// let hay = b"\r\nfoo\r\n"; /// // If CRLF mode weren't enabled here, then '$' wouldn't match /// // immediately after 'foo', and thus no match would be found. /// assert_eq!(Some(&b"foo"[..]), re.find(hay).map(|m| m.as_bytes())); /// ``` /// /// This example demonstrates that `^` will never match at a position /// between `\r` and `\n`. (`$` will similarly not match between a `\r` /// and a `\n`.) /// /// ``` /// use regex::bytes::RegexBuilder; /// /// let re = RegexBuilder::new(r"^") /// .multi_line(true) /// .crlf(true) /// .build() /// .unwrap(); /// let hay = b"\r\n\r\n"; /// let ranges: Vec<_> = re.find_iter(hay).map(|m| m.range()).collect(); /// assert_eq!(ranges, vec![0..0, 2..2, 4..4]); /// ``` pub fn crlf(&mut self, yes: bool) -> &mut RegexBuilder { self.builder.crlf(yes); self } /// Configures the line terminator to be used by the regex. /// /// The line terminator is relevant in two ways for a particular regex: /// /// * When dot-matches-new-line mode is *not* enabled (the default), /// then `.` will match any character except for the configured line /// terminator. /// * When multi-line mode is enabled (not the default), then `^` and /// `$` will match immediately after and before, respectively, a line /// terminator. /// /// In both cases, if CRLF mode is enabled in a particular context, /// then it takes precedence over any configured line terminator. /// /// This option cannot be configured from within the pattern. /// /// The default line terminator is `\n`. /// /// # Example /// /// This shows how to treat the NUL byte as a line terminator. This can /// be a useful heuristic when searching binary data. /// /// ``` /// use regex::bytes::RegexBuilder; /// /// let re = RegexBuilder::new(r"^foo$") /// .multi_line(true) /// .line_terminator(b'\x00') /// .build() /// .unwrap(); /// let hay = b"\x00foo\x00"; /// assert_eq!(Some(1..4), re.find(hay).map(|m| m.range())); /// ``` /// /// This example shows that the behavior of `.` is impacted by this /// setting as well: /// /// ``` /// use regex::bytes::RegexBuilder; /// /// let re = RegexBuilder::new(r".") /// .line_terminator(b'\x00') /// .build() /// .unwrap(); /// assert!(re.is_match(b"\n")); /// assert!(!re.is_match(b"\x00")); /// ``` /// /// This shows that building a regex will work even when the byte /// given is not ASCII. This is unlike the top-level `Regex` API where /// matching invalid UTF-8 is not allowed. /// /// Note though that you must disable Unicode mode. This is required /// because Unicode mode requires matching one codepoint at a time, /// and there is no way to match a non-ASCII byte as if it were a /// codepoint. /// /// ``` /// use regex::bytes::RegexBuilder; /// /// assert!( /// RegexBuilder::new(r".") /// .unicode(false) /// .line_terminator(0x80) /// .build() /// .is_ok(), /// ); /// ``` pub fn line_terminator(&mut self, byte: u8) -> &mut RegexBuilder { self.builder.line_terminator(byte); self } /// This configures swap-greed mode for the entire pattern. /// /// When swap-greed mode is enabled, patterns like `a+` will become /// non-greedy and patterns like `a+?` will become greedy. In other /// words, the meanings of `a+` and `a+?` are switched. /// /// This setting can also be configured using the inline flag `U` in /// the pattern. /// /// The default for this is `false`. /// /// # Example /// /// ``` /// use regex::bytes::RegexBuilder; /// /// let re = RegexBuilder::new(r"a+") /// .swap_greed(true) /// .build() /// .unwrap(); /// assert_eq!(Some(&b"a"[..]), re.find(b"aaa").map(|m| m.as_bytes())); /// ``` pub fn swap_greed(&mut self, yes: bool) -> &mut RegexBuilder { self.builder.swap_greed(yes); self } /// This configures verbose mode for the entire pattern. /// /// When enabled, whitespace will treated as insignificant in the /// pattern and `#` can be used to start a comment until the next new /// line. /// /// Normally, in most places in a pattern, whitespace is treated /// literally. For example ` +` will match one or more ASCII whitespace /// characters. /// /// When verbose mode is enabled, `\#` can be used to match a literal /// `#` and `\ ` can be used to match a literal ASCII whitespace /// character. /// /// Verbose mode is useful for permitting regexes to be formatted and /// broken up more nicely. This may make them more easily readable. /// /// This setting can also be configured using the inline flag `x` in /// the pattern. /// /// The default for this is `false`. /// /// # Example /// /// ``` /// use regex::bytes::RegexBuilder; /// /// let pat = r" /// \b /// (?\p{Uppercase}\w*) # always start with uppercase letter /// [\s--\n]+ # whitespace should separate names /// (?: # middle name can be an initial! /// (?:(?\p{Uppercase})\.|(?\p{Uppercase}\w*)) /// [\s--\n]+ /// )? /// (?\p{Uppercase}\w*) /// \b /// "; /// let re = RegexBuilder::new(pat) /// .ignore_whitespace(true) /// .build() /// .unwrap(); /// /// let caps = re.captures(b"Harry Potter").unwrap(); /// assert_eq!(&b"Harry"[..], &caps["first"]); /// assert_eq!(&b"Potter"[..], &caps["last"]); /// /// let caps = re.captures(b"Harry J. Potter").unwrap(); /// assert_eq!(&b"Harry"[..], &caps["first"]); /// // Since a middle name/initial isn't required for an overall match, /// // we can't assume that 'initial' or 'middle' will be populated! /// assert_eq!( /// Some(&b"J"[..]), /// caps.name("initial").map(|m| m.as_bytes()), /// ); /// assert_eq!(None, caps.name("middle").map(|m| m.as_bytes())); /// assert_eq!(&b"Potter"[..], &caps["last"]); /// /// let caps = re.captures(b"Harry James Potter").unwrap(); /// assert_eq!(&b"Harry"[..], &caps["first"]); /// // Since a middle name/initial isn't required for an overall match, /// // we can't assume that 'initial' or 'middle' will be populated! /// assert_eq!(None, caps.name("initial").map(|m| m.as_bytes())); /// assert_eq!( /// Some(&b"James"[..]), /// caps.name("middle").map(|m| m.as_bytes()), /// ); /// assert_eq!(&b"Potter"[..], &caps["last"]); /// ``` pub fn ignore_whitespace(&mut self, yes: bool) -> &mut RegexBuilder { self.builder.ignore_whitespace(yes); self } /// This configures octal mode for the entire pattern. /// /// Octal syntax is a little-known way of uttering Unicode codepoints /// in a pattern. For example, `a`, `\x61`, `\u0061` and `\141` are all /// equivalent patterns, where the last example shows octal syntax. /// /// While supporting octal syntax isn't in and of itself a problem, /// it does make good error messages harder. That is, in PCRE based /// regex engines, syntax like `\1` invokes a backreference, which is /// explicitly unsupported this library. However, many users expect /// backreferences to be supported. Therefore, when octal support /// is disabled, the error message will explicitly mention that /// backreferences aren't supported. /// /// The default for this is `false`. /// /// # Example /// /// ``` /// use regex::bytes::RegexBuilder; /// /// // Normally this pattern would not compile, with an error message /// // about backreferences not being supported. But with octal mode /// // enabled, octal escape sequences work. /// let re = RegexBuilder::new(r"\141") /// .octal(true) /// .build() /// .unwrap(); /// assert!(re.is_match(b"a")); /// ``` pub fn octal(&mut self, yes: bool) -> &mut RegexBuilder { self.builder.octal(yes); self } /// Sets the approximate size limit, in bytes, of the compiled regex. /// /// This roughly corresponds to the number of heap memory, in /// bytes, occupied by a single regex. If the regex would otherwise /// approximately exceed this limit, then compiling that regex will /// fail. /// /// The main utility of a method like this is to avoid compiling /// regexes that use an unexpected amount of resources, such as /// time and memory. Even if the memory usage of a large regex is /// acceptable, its search time may not be. Namely, worst case time /// complexity for search is `O(m * n)`, where `m ~ len(pattern)` and /// `n ~ len(haystack)`. That is, search time depends, in part, on the /// size of the compiled regex. This means that putting a limit on the /// size of the regex limits how much a regex can impact search time. /// /// For more information about regex size limits, see the section on /// [untrusted inputs](crate#untrusted-input) in the top-level crate /// documentation. /// /// The default for this is some reasonable number that permits most /// patterns to compile successfully. /// /// # Example /// /// ``` /// # if !cfg!(target_pointer_width = "64") { return; } // see #1041 /// use regex::bytes::RegexBuilder; /// /// // It may surprise you how big some seemingly small patterns can /// // be! Since \w is Unicode aware, this generates a regex that can /// // match approximately 140,000 distinct codepoints. /// assert!(RegexBuilder::new(r"\w").size_limit(45_000).build().is_err()); /// ``` pub fn size_limit(&mut self, bytes: usize) -> &mut RegexBuilder { self.builder.size_limit(bytes); self } /// Set the approximate capacity, in bytes, of the cache of transitions /// used by the lazy DFA. /// /// While the lazy DFA isn't always used, in tends to be the most /// commonly use regex engine in default configurations. It tends to /// adopt the performance profile of a fully build DFA, but without the /// downside of taking worst case exponential time to build. /// /// The downside is that it needs to keep a cache of transitions and /// states that are built while running a search, and this cache /// can fill up. When it fills up, the cache will reset itself. Any /// previously generated states and transitions will then need to be /// re-generated. If this happens too many times, then this library /// will bail out of using the lazy DFA and switch to a different regex /// engine. /// /// If your regex provokes this particular downside of the lazy DFA, /// then it may be beneficial to increase its cache capacity. This will /// potentially reduce the frequency of cache resetting (ideally to /// `0`). While it won't fix all potential performance problems with /// the lazy DFA, increasing the cache capacity does fix some. /// /// There is no easy way to determine, a priori, whether increasing /// this cache capacity will help. In general, the larger your regex, /// the more cache it's likely to use. But that isn't an ironclad rule. /// For example, a regex like `[01]*1[01]{N}` would normally produce a /// fully build DFA that is exponential in size with respect to `N`. /// The lazy DFA will prevent exponential space blow-up, but it cache /// is likely to fill up, even when it's large and even for smallish /// values of `N`. /// /// If you aren't sure whether this helps or not, it is sensible to /// set this to some arbitrarily large number in testing, such as /// `usize::MAX`. Namely, this represents the amount of capacity that /// *may* be used. It's probably not a good idea to use `usize::MAX` in /// production though, since it implies there are no controls on heap /// memory used by this library during a search. In effect, set it to /// whatever you're willing to allocate for a single regex search. pub fn dfa_size_limit(&mut self, bytes: usize) -> &mut RegexBuilder { self.builder.dfa_size_limit(bytes); self } /// Set the nesting limit for this parser. /// /// The nesting limit controls how deep the abstract syntax tree is /// allowed to be. If the AST exceeds the given limit (e.g., with too /// many nested groups), then an error is returned by the parser. /// /// The purpose of this limit is to act as a heuristic to prevent stack /// overflow for consumers that do structural induction on an AST using /// explicit recursion. While this crate never does this (instead using /// constant stack space and moving the call stack to the heap), other /// crates may. /// /// This limit is not checked until the entire AST is parsed. /// Therefore, if callers want to put a limit on the amount of heap /// space used, then they should impose a limit on the length, in /// bytes, of the concrete pattern string. In particular, this is /// viable since this parser implementation will limit itself to heap /// space proportional to the length of the pattern string. See also /// the [untrusted inputs](crate#untrusted-input) section in the /// top-level crate documentation for more information about this. /// /// Note that a nest limit of `0` will return a nest limit error for /// most patterns but not all. For example, a nest limit of `0` permits /// `a` but not `ab`, since `ab` requires an explicit concatenation, /// which results in a nest depth of `1`. In general, a nest limit is /// not something that manifests in an obvious way in the concrete /// syntax, therefore, it should not be used in a granular way. /// /// # Example /// /// ``` /// use regex::bytes::RegexBuilder; /// /// assert!(RegexBuilder::new(r"a").nest_limit(0).build().is_ok()); /// assert!(RegexBuilder::new(r"ab").nest_limit(0).build().is_err()); /// ``` pub fn nest_limit(&mut self, limit: u32) -> &mut RegexBuilder { self.builder.nest_limit(limit); self } } /// A configurable builder for a [`RegexSet`]. /// /// This builder can be used to programmatically set flags such as `i` /// (case insensitive) and `x` (for verbose mode). This builder can also be /// used to configure things like the line terminator and a size limit on /// the compiled regular expression. #[derive(Clone, Debug)] pub struct RegexSetBuilder { builder: Builder, } impl RegexSetBuilder { /// Create a new builder with a default configuration for the given /// patterns. /// /// If the patterns are invalid or exceed the configured size limits, /// then an error will be returned when [`RegexSetBuilder::build`] is /// called. pub fn new(patterns: I) -> RegexSetBuilder where I: IntoIterator, S: AsRef, { RegexSetBuilder { builder: Builder::new(patterns) } } /// Compiles the patterns given to `RegexSetBuilder::new` with the /// configuration set on this builder. /// /// If the patterns aren't valid regexes or if a configured size limit /// was exceeded, then an error is returned. pub fn build(&self) -> Result { self.builder.build_many_bytes() } /// This configures Unicode mode for the all of the patterns. /// /// Enabling Unicode mode does a number of things: /// /// * Most fundamentally, it causes the fundamental atom of matching /// to be a single codepoint. When Unicode mode is disabled, it's a /// single byte. For example, when Unicode mode is enabled, `.` will /// match `๐Ÿ’ฉ` once, where as it will match 4 times when Unicode mode /// is disabled. (Since the UTF-8 encoding of `๐Ÿ’ฉ` is 4 bytes long.) /// * Case insensitive matching uses Unicode simple case folding rules. /// * Unicode character classes like `\p{Letter}` and `\p{Greek}` are /// available. /// * Perl character classes are Unicode aware. That is, `\w`, `\s` and /// `\d`. /// * The word boundary assertions, `\b` and `\B`, use the Unicode /// definition of a word character. /// /// Note that unlike the top-level `RegexSet` for searching `&str`, /// it is permitted to disable Unicode mode even if the resulting /// pattern could match invalid UTF-8. For example, `(?-u:.)` is not /// a valid pattern for a top-level `RegexSet`, but is valid for a /// `bytes::RegexSet`. /// /// For more details on the Unicode support in this crate, see the /// [Unicode section](crate#unicode) in this crate's top-level /// documentation. /// /// The default for this is `true`. /// /// # Example /// /// ``` /// use regex::bytes::RegexSetBuilder; /// /// let re = RegexSetBuilder::new([r"\w"]) /// .unicode(false) /// .build() /// .unwrap(); /// // Normally greek letters would be included in \w, but since /// // Unicode mode is disabled, it only matches ASCII letters. /// assert!(!re.is_match("ฮด".as_bytes())); /// /// let re = RegexSetBuilder::new([r"s"]) /// .case_insensitive(true) /// .unicode(false) /// .build() /// .unwrap(); /// // Normally 'ลฟ' is included when searching for 's' case /// // insensitively due to Unicode's simple case folding rules. But /// // when Unicode mode is disabled, only ASCII case insensitive rules /// // are used. /// assert!(!re.is_match("ลฟ".as_bytes())); /// ``` /// /// Since this builder is for constructing a /// [`bytes::RegexSet`](RegexSet), one can disable Unicode mode even if /// it would match invalid UTF-8: /// /// ``` /// use regex::bytes::RegexSetBuilder; /// /// let re = RegexSetBuilder::new([r"."]) /// .unicode(false) /// .build() /// .unwrap(); /// // Normally greek letters would be included in \w, but since /// // Unicode mode is disabled, it only matches ASCII letters. /// assert!(re.is_match(b"\xFF")); /// ``` pub fn unicode(&mut self, yes: bool) -> &mut RegexSetBuilder { self.builder.unicode(yes); self } /// This configures whether to enable case insensitive matching for all /// of the patterns. /// /// This setting can also be configured using the inline flag `i` /// in the pattern. For example, `(?i:foo)` matches `foo` case /// insensitively while `(?-i:foo)` matches `foo` case sensitively. /// /// The default for this is `false`. /// /// # Example /// /// ``` /// use regex::bytes::RegexSetBuilder; /// /// let re = RegexSetBuilder::new([r"foo(?-i:bar)quux"]) /// .case_insensitive(true) /// .build() /// .unwrap(); /// assert!(re.is_match(b"FoObarQuUx")); /// // Even though case insensitive matching is enabled in the builder, /// // it can be locally disabled within the pattern. In this case, /// // `bar` is matched case sensitively. /// assert!(!re.is_match(b"fooBARquux")); /// ``` pub fn case_insensitive(&mut self, yes: bool) -> &mut RegexSetBuilder { self.builder.case_insensitive(yes); self } /// This configures multi-line mode for all of the patterns. /// /// Enabling multi-line mode changes the behavior of the `^` and `$` /// anchor assertions. Instead of only matching at the beginning and /// end of a haystack, respectively, multi-line mode causes them to /// match at the beginning and end of a line *in addition* to the /// beginning and end of a haystack. More precisely, `^` will match at /// the position immediately following a `\n` and `$` will match at the /// position immediately preceding a `\n`. /// /// The behavior of this option can be impacted by other settings too: /// /// * The [`RegexSetBuilder::line_terminator`] option changes `\n` /// above to any ASCII byte. /// * The [`RegexSetBuilder::crlf`] option changes the line terminator /// to be either `\r` or `\n`, but never at the position between a `\r` /// and `\n`. /// /// This setting can also be configured using the inline flag `m` in /// the pattern. /// /// The default for this is `false`. /// /// # Example /// /// ``` /// use regex::bytes::RegexSetBuilder; /// /// let re = RegexSetBuilder::new([r"^foo$"]) /// .multi_line(true) /// .build() /// .unwrap(); /// assert!(re.is_match(b"\nfoo\n")); /// ``` pub fn multi_line(&mut self, yes: bool) -> &mut RegexSetBuilder { self.builder.multi_line(yes); self } /// This configures dot-matches-new-line mode for the entire pattern. /// /// Perhaps surprisingly, the default behavior for `.` is not to match /// any character, but rather, to match any character except for the /// line terminator (which is `\n` by default). When this mode is /// enabled, the behavior changes such that `.` truly matches any /// character. /// /// This setting can also be configured using the inline flag `s` in /// the pattern. For example, `(?s:.)` and `\p{any}` are equivalent /// regexes. /// /// The default for this is `false`. /// /// # Example /// /// ``` /// use regex::bytes::RegexSetBuilder; /// /// let re = RegexSetBuilder::new([r"foo.bar"]) /// .dot_matches_new_line(true) /// .build() /// .unwrap(); /// let hay = b"foo\nbar"; /// assert!(re.is_match(hay)); /// ``` pub fn dot_matches_new_line( &mut self, yes: bool, ) -> &mut RegexSetBuilder { self.builder.dot_matches_new_line(yes); self } /// This configures CRLF mode for all of the patterns. /// /// When CRLF mode is enabled, both `\r` ("carriage return" or CR for /// short) and `\n` ("line feed" or LF for short) are treated as line /// terminators. This results in the following: /// /// * Unless dot-matches-new-line mode is enabled, `.` will now match /// any character except for `\n` and `\r`. /// * When multi-line mode is enabled, `^` will match immediately /// following a `\n` or a `\r`. Similarly, `$` will match immediately /// preceding a `\n` or a `\r`. Neither `^` nor `$` will ever match /// between `\r` and `\n`. /// /// This setting can also be configured using the inline flag `R` in /// the pattern. /// /// The default for this is `false`. /// /// # Example /// /// ``` /// use regex::bytes::RegexSetBuilder; /// /// let re = RegexSetBuilder::new([r"^foo$"]) /// .multi_line(true) /// .crlf(true) /// .build() /// .unwrap(); /// let hay = b"\r\nfoo\r\n"; /// // If CRLF mode weren't enabled here, then '$' wouldn't match /// // immediately after 'foo', and thus no match would be found. /// assert!(re.is_match(hay)); /// ``` /// /// This example demonstrates that `^` will never match at a position /// between `\r` and `\n`. (`$` will similarly not match between a `\r` /// and a `\n`.) /// /// ``` /// use regex::bytes::RegexSetBuilder; /// /// let re = RegexSetBuilder::new([r"^\n"]) /// .multi_line(true) /// .crlf(true) /// .build() /// .unwrap(); /// assert!(!re.is_match(b"\r\n")); /// ``` pub fn crlf(&mut self, yes: bool) -> &mut RegexSetBuilder { self.builder.crlf(yes); self } /// Configures the line terminator to be used by the regex. /// /// The line terminator is relevant in two ways for a particular regex: /// /// * When dot-matches-new-line mode is *not* enabled (the default), /// then `.` will match any character except for the configured line /// terminator. /// * When multi-line mode is enabled (not the default), then `^` and /// `$` will match immediately after and before, respectively, a line /// terminator. /// /// In both cases, if CRLF mode is enabled in a particular context, /// then it takes precedence over any configured line terminator. /// /// This option cannot be configured from within the pattern. /// /// The default line terminator is `\n`. /// /// # Example /// /// This shows how to treat the NUL byte as a line terminator. This can /// be a useful heuristic when searching binary data. /// /// ``` /// use regex::bytes::RegexSetBuilder; /// /// let re = RegexSetBuilder::new([r"^foo$"]) /// .multi_line(true) /// .line_terminator(b'\x00') /// .build() /// .unwrap(); /// let hay = b"\x00foo\x00"; /// assert!(re.is_match(hay)); /// ``` /// /// This example shows that the behavior of `.` is impacted by this /// setting as well: /// /// ``` /// use regex::bytes::RegexSetBuilder; /// /// let re = RegexSetBuilder::new([r"."]) /// .line_terminator(b'\x00') /// .build() /// .unwrap(); /// assert!(re.is_match(b"\n")); /// assert!(!re.is_match(b"\x00")); /// ``` /// /// This shows that building a regex will work even when the byte given /// is not ASCII. This is unlike the top-level `RegexSet` API where /// matching invalid UTF-8 is not allowed. /// /// Note though that you must disable Unicode mode. This is required /// because Unicode mode requires matching one codepoint at a time, /// and there is no way to match a non-ASCII byte as if it were a /// codepoint. /// /// ``` /// use regex::bytes::RegexSetBuilder; /// /// assert!( /// RegexSetBuilder::new([r"."]) /// .unicode(false) /// .line_terminator(0x80) /// .build() /// .is_ok(), /// ); /// ``` pub fn line_terminator(&mut self, byte: u8) -> &mut RegexSetBuilder { self.builder.line_terminator(byte); self } /// This configures swap-greed mode for all of the patterns. /// /// When swap-greed mode is enabled, patterns like `a+` will become /// non-greedy and patterns like `a+?` will become greedy. In other /// words, the meanings of `a+` and `a+?` are switched. /// /// This setting can also be configured using the inline flag `U` in /// the pattern. /// /// Note that this is generally not useful for a `RegexSet` since a /// `RegexSet` can only report whether a pattern matches or not. Since /// greediness never impacts whether a match is found or not (only the /// offsets of the match), it follows that whether parts of a pattern /// are greedy or not doesn't matter for a `RegexSet`. /// /// The default for this is `false`. pub fn swap_greed(&mut self, yes: bool) -> &mut RegexSetBuilder { self.builder.swap_greed(yes); self } /// This configures verbose mode for all of the patterns. /// /// When enabled, whitespace will treated as insignificant in the /// pattern and `#` can be used to start a comment until the next new /// line. /// /// Normally, in most places in a pattern, whitespace is treated /// literally. For example ` +` will match one or more ASCII whitespace /// characters. /// /// When verbose mode is enabled, `\#` can be used to match a literal /// `#` and `\ ` can be used to match a literal ASCII whitespace /// character. /// /// Verbose mode is useful for permitting regexes to be formatted and /// broken up more nicely. This may make them more easily readable. /// /// This setting can also be configured using the inline flag `x` in /// the pattern. /// /// The default for this is `false`. /// /// # Example /// /// ``` /// use regex::bytes::RegexSetBuilder; /// /// let pat = r" /// \b /// (?\p{Uppercase}\w*) # always start with uppercase letter /// [\s--\n]+ # whitespace should separate names /// (?: # middle name can be an initial! /// (?:(?\p{Uppercase})\.|(?\p{Uppercase}\w*)) /// [\s--\n]+ /// )? /// (?\p{Uppercase}\w*) /// \b /// "; /// let re = RegexSetBuilder::new([pat]) /// .ignore_whitespace(true) /// .build() /// .unwrap(); /// assert!(re.is_match(b"Harry Potter")); /// assert!(re.is_match(b"Harry J. Potter")); /// assert!(re.is_match(b"Harry James Potter")); /// assert!(!re.is_match(b"harry J. Potter")); /// ``` pub fn ignore_whitespace( &mut self, yes: bool, ) -> &mut RegexSetBuilder { self.builder.ignore_whitespace(yes); self } /// This configures octal mode for all of the patterns. /// /// Octal syntax is a little-known way of uttering Unicode codepoints /// in a pattern. For example, `a`, `\x61`, `\u0061` and `\141` are all /// equivalent patterns, where the last example shows octal syntax. /// /// While supporting octal syntax isn't in and of itself a problem, /// it does make good error messages harder. That is, in PCRE based /// regex engines, syntax like `\1` invokes a backreference, which is /// explicitly unsupported this library. However, many users expect /// backreferences to be supported. Therefore, when octal support /// is disabled, the error message will explicitly mention that /// backreferences aren't supported. /// /// The default for this is `false`. /// /// # Example /// /// ``` /// use regex::bytes::RegexSetBuilder; /// /// // Normally this pattern would not compile, with an error message /// // about backreferences not being supported. But with octal mode /// // enabled, octal escape sequences work. /// let re = RegexSetBuilder::new([r"\141"]) /// .octal(true) /// .build() /// .unwrap(); /// assert!(re.is_match(b"a")); /// ``` pub fn octal(&mut self, yes: bool) -> &mut RegexSetBuilder { self.builder.octal(yes); self } /// Sets the approximate size limit, in bytes, of the compiled regex. /// /// This roughly corresponds to the number of heap memory, in /// bytes, occupied by a single regex. If the regex would otherwise /// approximately exceed this limit, then compiling that regex will /// fail. /// /// The main utility of a method like this is to avoid compiling /// regexes that use an unexpected amount of resources, such as /// time and memory. Even if the memory usage of a large regex is /// acceptable, its search time may not be. Namely, worst case time /// complexity for search is `O(m * n)`, where `m ~ len(pattern)` and /// `n ~ len(haystack)`. That is, search time depends, in part, on the /// size of the compiled regex. This means that putting a limit on the /// size of the regex limits how much a regex can impact search time. /// /// For more information about regex size limits, see the section on /// [untrusted inputs](crate#untrusted-input) in the top-level crate /// documentation. /// /// The default for this is some reasonable number that permits most /// patterns to compile successfully. /// /// # Example /// /// ``` /// # if !cfg!(target_pointer_width = "64") { return; } // see #1041 /// use regex::bytes::RegexSetBuilder; /// /// // It may surprise you how big some seemingly small patterns can /// // be! Since \w is Unicode aware, this generates a regex that can /// // match approximately 140,000 distinct codepoints. /// assert!( /// RegexSetBuilder::new([r"\w"]) /// .size_limit(45_000) /// .build() /// .is_err() /// ); /// ``` pub fn size_limit(&mut self, bytes: usize) -> &mut RegexSetBuilder { self.builder.size_limit(bytes); self } /// Set the approximate capacity, in bytes, of the cache of transitions /// used by the lazy DFA. /// /// While the lazy DFA isn't always used, in tends to be the most /// commonly use regex engine in default configurations. It tends to /// adopt the performance profile of a fully build DFA, but without the /// downside of taking worst case exponential time to build. /// /// The downside is that it needs to keep a cache of transitions and /// states that are built while running a search, and this cache /// can fill up. When it fills up, the cache will reset itself. Any /// previously generated states and transitions will then need to be /// re-generated. If this happens too many times, then this library /// will bail out of using the lazy DFA and switch to a different regex /// engine. /// /// If your regex provokes this particular downside of the lazy DFA, /// then it may be beneficial to increase its cache capacity. This will /// potentially reduce the frequency of cache resetting (ideally to /// `0`). While it won't fix all potential performance problems with /// the lazy DFA, increasing the cache capacity does fix some. /// /// There is no easy way to determine, a priori, whether increasing /// this cache capacity will help. In general, the larger your regex, /// the more cache it's likely to use. But that isn't an ironclad rule. /// For example, a regex like `[01]*1[01]{N}` would normally produce a /// fully build DFA that is exponential in size with respect to `N`. /// The lazy DFA will prevent exponential space blow-up, but it cache /// is likely to fill up, even when it's large and even for smallish /// values of `N`. /// /// If you aren't sure whether this helps or not, it is sensible to /// set this to some arbitrarily large number in testing, such as /// `usize::MAX`. Namely, this represents the amount of capacity that /// *may* be used. It's probably not a good idea to use `usize::MAX` in /// production though, since it implies there are no controls on heap /// memory used by this library during a search. In effect, set it to /// whatever you're willing to allocate for a single regex search. pub fn dfa_size_limit( &mut self, bytes: usize, ) -> &mut RegexSetBuilder { self.builder.dfa_size_limit(bytes); self } /// Set the nesting limit for this parser. /// /// The nesting limit controls how deep the abstract syntax tree is /// allowed to be. If the AST exceeds the given limit (e.g., with too /// many nested groups), then an error is returned by the parser. /// /// The purpose of this limit is to act as a heuristic to prevent stack /// overflow for consumers that do structural induction on an AST using /// explicit recursion. While this crate never does this (instead using /// constant stack space and moving the call stack to the heap), other /// crates may. /// /// This limit is not checked until the entire AST is parsed. /// Therefore, if callers want to put a limit on the amount of heap /// space used, then they should impose a limit on the length, in /// bytes, of the concrete pattern string. In particular, this is /// viable since this parser implementation will limit itself to heap /// space proportional to the length of the pattern string. See also /// the [untrusted inputs](crate#untrusted-input) section in the /// top-level crate documentation for more information about this. /// /// Note that a nest limit of `0` will return a nest limit error for /// most patterns but not all. For example, a nest limit of `0` permits /// `a` but not `ab`, since `ab` requires an explicit concatenation, /// which results in a nest depth of `1`. In general, a nest limit is /// not something that manifests in an obvious way in the concrete /// syntax, therefore, it should not be used in a granular way. /// /// # Example /// /// ``` /// use regex::bytes::RegexSetBuilder; /// /// assert!(RegexSetBuilder::new([r"a"]).nest_limit(0).build().is_ok()); /// assert!(RegexSetBuilder::new([r"ab"]).nest_limit(0).build().is_err()); /// ``` pub fn nest_limit(&mut self, limit: u32) -> &mut RegexSetBuilder { self.builder.nest_limit(limit); self } } } regex-1.12.2/src/bytes.rs000064400000000000000000000071441046102023000133120ustar 00000000000000/*! Search for regex matches in `&[u8]` haystacks. This module provides a nearly identical API via [`Regex`] to the one found in the top-level of this crate. There are two important differences: 1. Matching is done on `&[u8]` instead of `&str`. Additionally, `Vec` is used where `String` would have been used in the top-level API. 2. Unicode support can be disabled even when disabling it would result in matching invalid UTF-8 bytes. # Example: match null terminated string This shows how to find all null-terminated strings in a slice of bytes. This works even if a C string contains invalid UTF-8. ```rust use regex::bytes::Regex; let re = Regex::new(r"(?-u)(?[^\x00]+)\x00").unwrap(); let hay = b"foo\x00qu\xFFux\x00baz\x00"; // Extract all of the strings without the NUL terminator from each match. // The unwrap is OK here since a match requires the `cstr` capture to match. let cstrs: Vec<&[u8]> = re.captures_iter(hay) .map(|c| c.name("cstr").unwrap().as_bytes()) .collect(); assert_eq!(cstrs, vec![&b"foo"[..], &b"qu\xFFux"[..], &b"baz"[..]]); ``` # Example: selectively enable Unicode support This shows how to match an arbitrary byte pattern followed by a UTF-8 encoded string (e.g., to extract a title from a Matroska file): ```rust use regex::bytes::Regex; let re = Regex::new( r"(?-u)\x7b\xa9(?:[\x80-\xfe]|[\x40-\xff].)(?u:(.*))" ).unwrap(); let hay = b"\x12\xd0\x3b\x5f\x7b\xa9\x85\xe2\x98\x83\x80\x98\x54\x76\x68\x65"; // Notice that despite the `.*` at the end, it will only match valid UTF-8 // because Unicode mode was enabled with the `u` flag. Without the `u` flag, // the `.*` would match the rest of the bytes regardless of whether they were // valid UTF-8. let (_, [title]) = re.captures(hay).unwrap().extract(); assert_eq!(title, b"\xE2\x98\x83"); // We can UTF-8 decode the title now. And the unwrap here // is correct because the existence of a match guarantees // that `title` is valid UTF-8. let title = std::str::from_utf8(title).unwrap(); assert_eq!(title, "โ˜ƒ"); ``` In general, if the Unicode flag is enabled in a capture group and that capture is part of the overall match, then the capture is *guaranteed* to be valid UTF-8. # Syntax The supported syntax is pretty much the same as the syntax for Unicode regular expressions with a few changes that make sense for matching arbitrary bytes: 1. The `u` flag can be disabled even when disabling it might cause the regex to match invalid UTF-8. When the `u` flag is disabled, the regex is said to be in "ASCII compatible" mode. 2. In ASCII compatible mode, Unicode character classes are not allowed. Literal Unicode scalar values outside of character classes are allowed. 3. In ASCII compatible mode, Perl character classes (`\w`, `\d` and `\s`) revert to their typical ASCII definition. `\w` maps to `[[:word:]]`, `\d` maps to `[[:digit:]]` and `\s` maps to `[[:space:]]`. 4. In ASCII compatible mode, word boundaries use the ASCII compatible `\w` to determine whether a byte is a word byte or not. 5. Hexadecimal notation can be used to specify arbitrary bytes instead of Unicode codepoints. For example, in ASCII compatible mode, `\xFF` matches the literal byte `\xFF`, while in Unicode mode, `\xFF` is the Unicode codepoint `U+00FF` that matches its UTF-8 encoding of `\xC3\xBF`. Similarly for octal notation when enabled. 6. In ASCII compatible mode, `.` matches any *byte* except for `\n`. When the `s` flag is additionally enabled, `.` matches any byte. # Performance In general, one should expect performance on `&[u8]` to be roughly similar to performance on `&str`. */ pub use crate::{builders::bytes::*, regex::bytes::*, regexset::bytes::*}; regex-1.12.2/src/error.rs000064400000000000000000000101561046102023000133120ustar 00000000000000use alloc::string::{String, ToString}; use regex_automata::meta; /// An error that occurred during parsing or compiling a regular expression. #[non_exhaustive] #[derive(Clone, PartialEq)] pub enum Error { /// A syntax error. Syntax(String), /// The compiled program exceeded the set size /// limit. The argument is the size limit imposed by /// [`RegexBuilder::size_limit`](crate::RegexBuilder::size_limit). Even /// when not configured explicitly, it defaults to a reasonable limit. /// /// If you're getting this error, it occurred because your regex has been /// compiled to an intermediate state that is too big. It is important to /// note that exceeding this limit does _not_ mean the regex is too big to /// _work_, but rather, the regex is big enough that it may wind up being /// surprisingly slow when used in a search. In other words, this error is /// meant to be a practical heuristic for avoiding a performance footgun, /// and especially so for the case where the regex pattern is coming from /// an untrusted source. /// /// There are generally two ways to move forward if you hit this error. /// The first is to find some way to use a smaller regex. The second is to /// increase the size limit via `RegexBuilder::size_limit`. However, if /// your regex pattern is not from a trusted source, then neither of these /// approaches may be appropriate. Instead, you'll have to determine just /// how big of a regex you want to allow. CompiledTooBig(usize), } impl Error { pub(crate) fn from_meta_build_error(err: meta::BuildError) -> Error { if let Some(size_limit) = err.size_limit() { Error::CompiledTooBig(size_limit) } else if let Some(ref err) = err.syntax_error() { Error::Syntax(err.to_string()) } else { // This is a little suspect. Technically there are more ways for // a meta regex to fail to build other than "exceeded size limit" // and "syntax error." For example, if there are too many states // or even too many patterns. But in practice this is probably // good enough. The worst thing that happens is that Error::Syntax // represents an error that isn't technically a syntax error, but // the actual message will still be shown. So... it's not too bad. // // We really should have made the Error type in the regex crate // completely opaque. Rookie mistake. Error::Syntax(err.to_string()) } } } #[cfg(feature = "std")] impl std::error::Error for Error { // TODO: Remove this method entirely on the next breaking semver release. #[allow(deprecated)] fn description(&self) -> &str { match *self { Error::Syntax(ref err) => err, Error::CompiledTooBig(_) => "compiled program too big", } } } impl core::fmt::Display for Error { fn fmt(&self, f: &mut core::fmt::Formatter<'_>) -> core::fmt::Result { match *self { Error::Syntax(ref err) => err.fmt(f), Error::CompiledTooBig(limit) => write!( f, "Compiled regex exceeds size limit of {limit} bytes.", ), } } } // We implement our own Debug implementation so that we show nicer syntax // errors when people use `Regex::new(...).unwrap()`. It's a little weird, // but the `Syntax` variant is already storing a `String` anyway, so we might // as well format it nicely. impl core::fmt::Debug for Error { fn fmt(&self, f: &mut core::fmt::Formatter<'_>) -> core::fmt::Result { match *self { Error::Syntax(ref err) => { let hr: String = core::iter::repeat('~').take(79).collect(); writeln!(f, "Syntax(")?; writeln!(f, "{hr}")?; writeln!(f, "{err}")?; writeln!(f, "{hr}")?; write!(f, ")")?; Ok(()) } Error::CompiledTooBig(limit) => { f.debug_tuple("CompiledTooBig").field(&limit).finish() } } } } regex-1.12.2/src/find_byte.rs000064400000000000000000000011551046102023000141230ustar 00000000000000/// Searches for the given needle in the given haystack. /// /// If the perf-literal feature is enabled, then this uses the super optimized /// memchr crate. Otherwise, it uses the naive byte-at-a-time implementation. pub(crate) fn find_byte(needle: u8, haystack: &[u8]) -> Option { #[cfg(not(feature = "perf-literal"))] fn imp(needle: u8, haystack: &[u8]) -> Option { haystack.iter().position(|&b| b == needle) } #[cfg(feature = "perf-literal")] fn imp(needle: u8, haystack: &[u8]) -> Option { memchr::memchr(needle, haystack) } imp(needle, haystack) } regex-1.12.2/src/lib.rs000064400000000000000000001630141046102023000127310ustar 00000000000000/*! This crate provides routines for searching strings for matches of a [regular expression] (aka "regex"). The regex syntax supported by this crate is similar to other regex engines, but it lacks several features that are not known how to implement efficiently. This includes, but is not limited to, look-around and backreferences. In exchange, all regex searches in this crate have worst case `O(m * n)` time complexity, where `m` is proportional to the size of the regex and `n` is proportional to the size of the string being searched. [regular expression]: https://en.wikipedia.org/wiki/Regular_expression If you just want API documentation, then skip to the [`Regex`] type. Otherwise, here's a quick example showing one way of parsing the output of a grep-like program: ```rust use regex::Regex; let re = Regex::new(r"(?m)^([^:]+):([0-9]+):(.+)$").unwrap(); let hay = "\ path/to/foo:54:Blue Harvest path/to/bar:90:Something, Something, Something, Dark Side path/to/baz:3:It's a Trap! "; let mut results = vec![]; for (_, [path, lineno, line]) in re.captures_iter(hay).map(|c| c.extract()) { results.push((path, lineno.parse::()?, line)); } assert_eq!(results, vec![ ("path/to/foo", 54, "Blue Harvest"), ("path/to/bar", 90, "Something, Something, Something, Dark Side"), ("path/to/baz", 3, "It's a Trap!"), ]); # Ok::<(), Box>(()) ``` # Overview The primary type in this crate is a [`Regex`]. Its most important methods are as follows: * [`Regex::new`] compiles a regex using the default configuration. A [`RegexBuilder`] permits setting a non-default configuration. (For example, case insensitive matching, verbose mode and others.) * [`Regex::is_match`] reports whether a match exists in a particular haystack. * [`Regex::find`] reports the byte offsets of a match in a haystack, if one exists. [`Regex::find_iter`] returns an iterator over all such matches. * [`Regex::captures`] returns a [`Captures`], which reports both the byte offsets of a match in a haystack and the byte offsets of each matching capture group from the regex in the haystack. [`Regex::captures_iter`] returns an iterator over all such matches. There is also a [`RegexSet`], which permits searching for multiple regex patterns simultaneously in a single search. However, it currently only reports which patterns match and *not* the byte offsets of a match. Otherwise, this top-level crate documentation is organized as follows: * [Usage](#usage) shows how to add the `regex` crate to your Rust project. * [Examples](#examples) provides a limited selection of regex search examples. * [Performance](#performance) provides a brief summary of how to optimize regex searching speed. * [Unicode](#unicode) discusses support for non-ASCII patterns. * [Syntax](#syntax) enumerates the specific regex syntax supported by this crate. * [Untrusted input](#untrusted-input) discusses how this crate deals with regex patterns or haystacks that are untrusted. * [Crate features](#crate-features) documents the Cargo features that can be enabled or disabled for this crate. * [Other crates](#other-crates) links to other crates in the `regex` family. # Usage The `regex` crate is [on crates.io](https://crates.io/crates/regex) and can be used by adding `regex` to your dependencies in your project's `Cargo.toml`. Or more simply, just run `cargo add regex`. Here is a complete example that creates a new Rust project, adds a dependency on `regex`, creates the source code for a regex search and then runs the program. First, create the project in a new directory: ```text $ mkdir regex-example $ cd regex-example $ cargo init ``` Second, add a dependency on `regex`: ```text $ cargo add regex ``` Third, edit `src/main.rs`. Delete what's there and replace it with this: ``` use regex::Regex; fn main() { let re = Regex::new(r"Hello (?\w+)!").unwrap(); let Some(caps) = re.captures("Hello Murphy!") else { println!("no match!"); return; }; println!("The name is: {}", &caps["name"]); } ``` Fourth, run it with `cargo run`: ```text $ cargo run Compiling memchr v2.5.0 Compiling regex-syntax v0.7.1 Compiling aho-corasick v1.0.1 Compiling regex v1.8.1 Compiling regex-example v0.1.0 (/tmp/regex-example) Finished dev [unoptimized + debuginfo] target(s) in 4.22s Running `target/debug/regex-example` The name is: Murphy ``` The first time you run the program will show more output like above. But subsequent runs shouldn't have to re-compile the dependencies. # Examples This section provides a few examples, in tutorial style, showing how to search a haystack with a regex. There are more examples throughout the API documentation. Before starting though, it's worth defining a few terms: * A **regex** is a Rust value whose type is `Regex`. We use `re` as a variable name for a regex. * A **pattern** is the string that is used to build a regex. We use `pat` as a variable name for a pattern. * A **haystack** is the string that is searched by a regex. We use `hay` as a variable name for a haystack. Sometimes the words "regex" and "pattern" are used interchangeably. General use of regular expressions in this crate proceeds by compiling a **pattern** into a **regex**, and then using that regex to search, split or replace parts of a **haystack**. ### Example: find a middle initial We'll start off with a very simple example: a regex that looks for a specific name but uses a wildcard to match a middle initial. Our pattern serves as something like a template that will match a particular name with *any* middle initial. ```rust use regex::Regex; // We use 'unwrap()' here because it would be a bug in our program if the // pattern failed to compile to a regex. Panicking in the presence of a bug // is okay. let re = Regex::new(r"Homer (.)\. Simpson").unwrap(); let hay = "Homer J. Simpson"; let Some(caps) = re.captures(hay) else { return }; assert_eq!("J", &caps[1]); ``` There are a few things worth noticing here in our first example: * The `.` is a special pattern meta character that means "match any single character except for new lines." (More precisely, in this crate, it means "match any UTF-8 encoding of any Unicode scalar value other than `\n`.") * We can match an actual `.` literally by escaping it, i.e., `\.`. * We use Rust's [raw strings] to avoid needing to deal with escape sequences in both the regex pattern syntax and in Rust's string literal syntax. If we didn't use raw strings here, we would have had to use `\\.` to match a literal `.` character. That is, `r"\."` and `"\\."` are equivalent patterns. * We put our wildcard `.` instruction in parentheses. These parentheses have a special meaning that says, "make whatever part of the haystack matches within these parentheses available as a capturing group." After finding a match, we access this capture group with `&caps[1]`. [raw strings]: https://doc.rust-lang.org/stable/reference/tokens.html#raw-string-literals Otherwise, we execute a search using `re.captures(hay)` and return from our function if no match occurred. We then reference the middle initial by asking for the part of the haystack that matched the capture group indexed at `1`. (The capture group at index 0 is implicit and always corresponds to the entire match. In this case, that's `Homer J. Simpson`.) ### Example: named capture groups Continuing from our middle initial example above, we can tweak the pattern slightly to give a name to the group that matches the middle initial: ```rust use regex::Regex; // Note that (?P.) is a different way to spell the same thing. let re = Regex::new(r"Homer (?.)\. Simpson").unwrap(); let hay = "Homer J. Simpson"; let Some(caps) = re.captures(hay) else { return }; assert_eq!("J", &caps["middle"]); ``` Giving a name to a group can be useful when there are multiple groups in a pattern. It makes the code referring to those groups a bit easier to understand. ### Example: validating a particular date format This examples shows how to confirm whether a haystack, in its entirety, matches a particular date format: ```rust use regex::Regex; let re = Regex::new(r"^\d{4}-\d{2}-\d{2}$").unwrap(); assert!(re.is_match("2010-03-14")); ``` Notice the use of the `^` and `$` anchors. In this crate, every regex search is run with an implicit `(?s:.)*?` at the beginning of its pattern, which allows the regex to match anywhere in a haystack. Anchors, as above, can be used to ensure that the full haystack matches a pattern. This crate is also Unicode aware by default, which means that `\d` might match more than you might expect it to. For example: ```rust use regex::Regex; let re = Regex::new(r"^\d{4}-\d{2}-\d{2}$").unwrap(); assert!(re.is_match("๐Ÿš๐Ÿ˜๐Ÿ™๐Ÿ˜-๐Ÿ˜๐Ÿ›-๐Ÿ™๐Ÿœ")); ``` To only match an ASCII decimal digit, all of the following are equivalent: * `[0-9]` * `(?-u:\d)` * `[[:digit:]]` * `[\d&&\p{ascii}]` ### Example: finding dates in a haystack In the previous example, we showed how one might validate that a haystack, in its entirety, corresponded to a particular date format. But what if we wanted to extract all things that look like dates in a specific format from a haystack? To do this, we can use an iterator API to find all matches (notice that we've removed the anchors and switched to looking for ASCII-only digits): ```rust use regex::Regex; let re = Regex::new(r"[0-9]{4}-[0-9]{2}-[0-9]{2}").unwrap(); let hay = "What do 1865-04-14, 1881-07-02, 1901-09-06 and 1963-11-22 have in common?"; // 'm' is a 'Match', and 'as_str()' returns the matching part of the haystack. let dates: Vec<&str> = re.find_iter(hay).map(|m| m.as_str()).collect(); assert_eq!(dates, vec![ "1865-04-14", "1881-07-02", "1901-09-06", "1963-11-22", ]); ``` We can also iterate over [`Captures`] values instead of [`Match`] values, and that in turn permits accessing each component of the date via capturing groups: ```rust use regex::Regex; let re = Regex::new(r"(?[0-9]{4})-(?[0-9]{2})-(?[0-9]{2})").unwrap(); let hay = "What do 1865-04-14, 1881-07-02, 1901-09-06 and 1963-11-22 have in common?"; // 'm' is a 'Match', and 'as_str()' returns the matching part of the haystack. let dates: Vec<(&str, &str, &str)> = re.captures_iter(hay).map(|caps| { // The unwraps are okay because every capture group must match if the whole // regex matches, and in this context, we know we have a match. // // Note that we use `caps.name("y").unwrap().as_str()` instead of // `&caps["y"]` because the lifetime of the former is the same as the // lifetime of `hay` above, but the lifetime of the latter is tied to the // lifetime of `caps` due to how the `Index` trait is defined. let year = caps.name("y").unwrap().as_str(); let month = caps.name("m").unwrap().as_str(); let day = caps.name("d").unwrap().as_str(); (year, month, day) }).collect(); assert_eq!(dates, vec![ ("1865", "04", "14"), ("1881", "07", "02"), ("1901", "09", "06"), ("1963", "11", "22"), ]); ``` ### Example: simpler capture group extraction One can use [`Captures::extract`] to make the code from the previous example a bit simpler in this case: ```rust use regex::Regex; let re = Regex::new(r"([0-9]{4})-([0-9]{2})-([0-9]{2})").unwrap(); let hay = "What do 1865-04-14, 1881-07-02, 1901-09-06 and 1963-11-22 have in common?"; let dates: Vec<(&str, &str, &str)> = re.captures_iter(hay).map(|caps| { let (_, [year, month, day]) = caps.extract(); (year, month, day) }).collect(); assert_eq!(dates, vec![ ("1865", "04", "14"), ("1881", "07", "02"), ("1901", "09", "06"), ("1963", "11", "22"), ]); ``` `Captures::extract` works by ensuring that the number of matching groups match the number of groups requested via the `[year, month, day]` syntax. If they do, then the substrings for each corresponding capture group are automatically returned in an appropriately sized array. Rust's syntax for pattern matching arrays does the rest. ### Example: replacement with named capture groups Building on the previous example, perhaps we'd like to rearrange the date formats. This can be done by finding each match and replacing it with something different. The [`Regex::replace_all`] routine provides a convenient way to do this, including by supporting references to named groups in the replacement string: ```rust use regex::Regex; let re = Regex::new(r"(?\d{4})-(?\d{2})-(?\d{2})").unwrap(); let before = "1973-01-05, 1975-08-25 and 1980-10-18"; let after = re.replace_all(before, "$m/$d/$y"); assert_eq!(after, "01/05/1973, 08/25/1975 and 10/18/1980"); ``` The replace methods are actually polymorphic in the replacement, which provides more flexibility than is seen here. (See the documentation for [`Regex::replace`] for more details.) ### Example: verbose mode When your regex gets complicated, you might consider using something other than regex. But if you stick with regex, you can use the `x` flag to enable insignificant whitespace mode or "verbose mode." In this mode, whitespace is treated as insignificant and one may write comments. This may make your patterns easier to comprehend. ```rust use regex::Regex; let re = Regex::new(r"(?x) (?P\d{4}) # the year, including all Unicode digits - (?P\d{2}) # the month, including all Unicode digits - (?P\d{2}) # the day, including all Unicode digits ").unwrap(); let before = "1973-01-05, 1975-08-25 and 1980-10-18"; let after = re.replace_all(before, "$m/$d/$y"); assert_eq!(after, "01/05/1973, 08/25/1975 and 10/18/1980"); ``` If you wish to match against whitespace in this mode, you can still use `\s`, `\n`, `\t`, etc. For escaping a single space character, you can escape it directly with `\ `, use its hex character code `\x20` or temporarily disable the `x` flag, e.g., `(?-x: )`. ### Example: match multiple regular expressions simultaneously This demonstrates how to use a [`RegexSet`] to match multiple (possibly overlapping) regexes in a single scan of a haystack: ```rust use regex::RegexSet; let set = RegexSet::new(&[ r"\w+", r"\d+", r"\pL+", r"foo", r"bar", r"barfoo", r"foobar", ]).unwrap(); // Iterate over and collect all of the matches. Each match corresponds to the // ID of the matching pattern. let matches: Vec<_> = set.matches("foobar").into_iter().collect(); assert_eq!(matches, vec![0, 2, 3, 4, 6]); // You can also test whether a particular regex matched: let matches = set.matches("foobar"); assert!(!matches.matched(5)); assert!(matches.matched(6)); ``` # Performance This section briefly discusses a few concerns regarding the speed and resource usage of regexes. ### Only ask for what you need When running a search with a regex, there are generally three different types of information one can ask for: 1. Does a regex match in a haystack? 2. Where does a regex match in a haystack? 3. Where do each of the capturing groups match in a haystack? Generally speaking, this crate could provide a function to answer only #3, which would subsume #1 and #2 automatically. However, it can be significantly more expensive to compute the location of capturing group matches, so it's best not to do it if you don't need to. Therefore, only ask for what you need. For example, don't use [`Regex::find`] if you only need to test if a regex matches a haystack. Use [`Regex::is_match`] instead. ### Unicode can impact memory usage and search speed This crate has first class support for Unicode and it is **enabled by default**. In many cases, the extra memory required to support it will be negligible and it typically won't impact search speed. But it can in some cases. With respect to memory usage, the impact of Unicode principally manifests through the use of Unicode character classes. Unicode character classes tend to be quite large. For example, `\w` by default matches around 140,000 distinct codepoints. This requires additional memory, and tends to slow down regex compilation. While a `\w` here and there is unlikely to be noticed, writing `\w{100}` will for example result in quite a large regex by default. Indeed, `\w` is considerably larger than its ASCII-only version, so if your requirements are satisfied by ASCII, it's probably a good idea to stick to ASCII classes. The ASCII-only version of `\w` can be spelled in a number of ways. All of the following are equivalent: * `[0-9A-Za-z_]` * `(?-u:\w)` * `[[:word:]]` * `[\w&&\p{ascii}]` With respect to search speed, Unicode tends to be handled pretty well, even when using large Unicode character classes. However, some of the faster internal regex engines cannot handle a Unicode aware word boundary assertion. So if you don't need Unicode-aware word boundary assertions, you might consider using `(?-u:\b)` instead of `\b`, where the former uses an ASCII-only definition of a word character. ### Literals might accelerate searches This crate tends to be quite good at recognizing literals in a regex pattern and using them to accelerate a search. If it is at all possible to include some kind of literal in your pattern, then it might make search substantially faster. For example, in the regex `\w+@\w+`, the engine will look for occurrences of `@` and then try a reverse match for `\w+` to find the start position. ### Avoid re-compiling regexes, especially in a loop It is an anti-pattern to compile the same pattern in a loop since regex compilation is typically expensive. (It takes anywhere from a few microseconds to a few **milliseconds** depending on the size of the pattern.) Not only is compilation itself expensive, but this also prevents optimizations that reuse allocations internally to the regex engine. In Rust, it can sometimes be a pain to pass regular expressions around if they're used from inside a helper function. Instead, we recommend using [`std::sync::LazyLock`], or the [`once_cell`] crate, if you can't use the standard library. This example shows how to use `std::sync::LazyLock`: ```rust use std::sync::LazyLock; use regex::Regex; fn some_helper_function(haystack: &str) -> bool { static RE: LazyLock = LazyLock::new(|| Regex::new(r"...").unwrap()); RE.is_match(haystack) } fn main() { assert!(some_helper_function("abc")); assert!(!some_helper_function("ac")); } ``` Specifically, in this example, the regex will be compiled when it is used for the first time. On subsequent uses, it will reuse the previously built `Regex`. Notice how one can define the `Regex` locally to a specific function. [`std::sync::LazyLock`]: https://doc.rust-lang.org/std/sync/struct.LazyLock.html [`once_cell`]: https://crates.io/crates/once_cell ### Sharing a regex across threads can result in contention While a single `Regex` can be freely used from multiple threads simultaneously, there is a small synchronization cost that must be paid. Generally speaking, one shouldn't expect to observe this unless the principal task in each thread is searching with the regex *and* most searches are on short haystacks. In this case, internal contention on shared resources can spike and increase latency, which in turn may slow down each individual search. One can work around this by cloning each `Regex` before sending it to another thread. The cloned regexes will still share the same internal read-only portion of its compiled state (it's reference counted), but each thread will get optimized access to the mutable space that is used to run a search. In general, there is no additional cost in memory to doing this. The only cost is the added code complexity required to explicitly clone the regex. (If you share the same `Regex` across multiple threads, each thread still gets its own mutable space, but accessing that space is slower.) # Unicode This section discusses what kind of Unicode support this regex library has. Before showing some examples, we'll summarize the relevant points: * This crate almost fully implements "Basic Unicode Support" (Level 1) as specified by the [Unicode Technical Standard #18][UTS18]. The full details of what is supported are documented in [UNICODE.md] in the root of the regex crate repository. There is virtually no support for "Extended Unicode Support" (Level 2) from UTS#18. * The top-level [`Regex`] runs searches *as if* iterating over each of the codepoints in the haystack. That is, the fundamental atom of matching is a single codepoint. * [`bytes::Regex`], in contrast, permits disabling Unicode mode for part of all of your pattern in all cases. When Unicode mode is disabled, then a search is run *as if* iterating over each byte in the haystack. That is, the fundamental atom of matching is a single byte. (A top-level `Regex` also permits disabling Unicode and thus matching *as if* it were one byte at a time, but only when doing so wouldn't permit matching invalid UTF-8.) * When Unicode mode is enabled (the default), `.` will match an entire Unicode scalar value, even when it is encoded using multiple bytes. When Unicode mode is disabled (e.g., `(?-u:.)`), then `.` will match a single byte in all cases. * The character classes `\w`, `\d` and `\s` are all Unicode-aware by default. Use `(?-u:\w)`, `(?-u:\d)` and `(?-u:\s)` to get their ASCII-only definitions. * Similarly, `\b` and `\B` use a Unicode definition of a "word" character. To get ASCII-only word boundaries, use `(?-u:\b)` and `(?-u:\B)`. This also applies to the special word boundary assertions. (That is, `\b{start}`, `\b{end}`, `\b{start-half}`, `\b{end-half}`.) * `^` and `$` are **not** Unicode-aware in multi-line mode. Namely, they only recognize `\n` (assuming CRLF mode is not enabled) and not any of the other forms of line terminators defined by Unicode. * Case insensitive searching is Unicode-aware and uses simple case folding. * Unicode general categories, scripts and many boolean properties are available by default via the `\p{property name}` syntax. * In all cases, matches are reported using byte offsets. Or more precisely, UTF-8 code unit offsets. This permits constant time indexing and slicing of the haystack. [UTS18]: https://unicode.org/reports/tr18/ [UNICODE.md]: https://github.com/rust-lang/regex/blob/master/UNICODE.md Patterns themselves are **only** interpreted as a sequence of Unicode scalar values. This means you can use Unicode characters directly in your pattern: ```rust use regex::Regex; let re = Regex::new(r"(?i)ฮ”+").unwrap(); let m = re.find("ฮ”ฮดฮ”").unwrap(); assert_eq!((0, 6), (m.start(), m.end())); // alternatively: assert_eq!(0..6, m.range()); ``` As noted above, Unicode general categories, scripts, script extensions, ages and a smattering of boolean properties are available as character classes. For example, you can match a sequence of numerals, Greek or Cherokee letters: ```rust use regex::Regex; let re = Regex::new(r"[\pN\p{Greek}\p{Cherokee}]+").unwrap(); let m = re.find("abcฮ”แŽ ฮฒโ… แดฮณฮดโ…กxyz").unwrap(); assert_eq!(3..23, m.range()); ``` While not specific to Unicode, this library also supports character class set operations. Namely, one can nest character classes arbitrarily and perform set operations on them. Those set operations are union (the default), intersection, difference and symmetric difference. These set operations tend to be most useful with Unicode character classes. For example, to match any codepoint that is both in the `Greek` script and in the `Letter` general category: ```rust use regex::Regex; let re = Regex::new(r"[\p{Greek}&&\pL]+").unwrap(); let subs: Vec<&str> = re.find_iter("ฮ”ฮดฮ”๐…Œฮ”ฮดฮ”").map(|m| m.as_str()).collect(); assert_eq!(subs, vec!["ฮ”ฮดฮ”", "ฮ”ฮดฮ”"]); // If we just matches on Greek, then all codepoints would match! let re = Regex::new(r"\p{Greek}+").unwrap(); let subs: Vec<&str> = re.find_iter("ฮ”ฮดฮ”๐…Œฮ”ฮดฮ”").map(|m| m.as_str()).collect(); assert_eq!(subs, vec!["ฮ”ฮดฮ”๐…Œฮ”ฮดฮ”"]); ``` ### Opt out of Unicode support The [`bytes::Regex`] type that can be used to search `&[u8]` haystacks. By default, haystacks are conventionally treated as UTF-8 just like it is with the main `Regex` type. However, this behavior can be disabled by turning off the `u` flag, even if doing so could result in matching invalid UTF-8. For example, when the `u` flag is disabled, `.` will match any byte instead of any Unicode scalar value. Disabling the `u` flag is also possible with the standard `&str`-based `Regex` type, but it is only allowed where the UTF-8 invariant is maintained. For example, `(?-u:\w)` is an ASCII-only `\w` character class and is legal in an `&str`-based `Regex`, but `(?-u:\W)` will attempt to match *any byte* that isn't in `(?-u:\w)`, which in turn includes bytes that are invalid UTF-8. Similarly, `(?-u:\xFF)` will attempt to match the raw byte `\xFF` (instead of `U+00FF`), which is invalid UTF-8 and therefore is illegal in `&str`-based regexes. Finally, since Unicode support requires bundling large Unicode data tables, this crate exposes knobs to disable the compilation of those data tables, which can be useful for shrinking binary size and reducing compilation times. For details on how to do that, see the section on [crate features](#crate-features). # Syntax The syntax supported in this crate is documented below. Note that the regular expression parser and abstract syntax are exposed in a separate crate, [`regex-syntax`](https://docs.rs/regex-syntax). ### Matching one character
.             any character except new line (includes new line with s flag)
[0-9]         any ASCII digit
\d            digit (\p{Nd})
\D            not digit
\pX           Unicode character class identified by a one-letter name
\p{Greek}     Unicode character class (general category or script)
\PX           Negated Unicode character class identified by a one-letter name
\P{Greek}     negated Unicode character class (general category or script)
### Character classes
[xyz]         A character class matching either x, y or z (union).
[^xyz]        A character class matching any character except x, y and z.
[a-z]         A character class matching any character in range a-z.
[[:alpha:]]   ASCII character class ([A-Za-z])
[[:^alpha:]]  Negated ASCII character class ([^A-Za-z])
[x[^xyz]]     Nested/grouping character class (matching any character except y and z)
[a-y&&xyz]    Intersection (matching x or y)
[0-9&&[^4]]   Subtraction using intersection and negation (matching 0-9 except 4)
[0-9--4]      Direct subtraction (matching 0-9 except 4)
[a-g~~b-h]    Symmetric difference (matching `a` and `h` only)
[\[\]]        Escaping in character classes (matching [ or ])
[a&&b]        An empty character class matching nothing
Any named character class may appear inside a bracketed `[...]` character class. For example, `[\p{Greek}[:digit:]]` matches any ASCII digit or any codepoint in the `Greek` script. `[\p{Greek}&&\pL]` matches Greek letters. Precedence in character classes, from most binding to least: 1. Ranges: `[a-cd]` == `[[a-c]d]` 2. Union: `[ab&&bc]` == `[[ab]&&[bc]]` 3. Intersection, difference, symmetric difference. All three have equivalent precedence, and are evaluated in left-to-right order. For example, `[\pL--\p{Greek}&&\p{Uppercase}]` == `[[\pL--\p{Greek}]&&\p{Uppercase}]`. 4. Negation: `[^a-z&&b]` == `[^[a-z&&b]]`. ### Composites
xy    concatenation (x followed by y)
x|y   alternation (x or y, prefer x)
This example shows how an alternation works, and what it means to prefer a branch in the alternation over subsequent branches. ``` use regex::Regex; let haystack = "samwise"; // If 'samwise' comes first in our alternation, then it is // preferred as a match, even if the regex engine could // technically detect that 'sam' led to a match earlier. let re = Regex::new(r"samwise|sam").unwrap(); assert_eq!("samwise", re.find(haystack).unwrap().as_str()); // But if 'sam' comes first, then it will match instead. // In this case, it is impossible for 'samwise' to match // because 'sam' is a prefix of it. let re = Regex::new(r"sam|samwise").unwrap(); assert_eq!("sam", re.find(haystack).unwrap().as_str()); ``` ### Repetitions
x*        zero or more of x (greedy)
x+        one or more of x (greedy)
x?        zero or one of x (greedy)
x*?       zero or more of x (ungreedy/lazy)
x+?       one or more of x (ungreedy/lazy)
x??       zero or one of x (ungreedy/lazy)
x{n,m}    at least n x and at most m x (greedy)
x{n,}     at least n x (greedy)
x{n}      exactly n x
x{n,m}?   at least n x and at most m x (ungreedy/lazy)
x{n,}?    at least n x (ungreedy/lazy)
x{n}?     exactly n x
### Empty matches
^               the beginning of a haystack (or start-of-line with multi-line mode)
$               the end of a haystack (or end-of-line with multi-line mode)
\A              only the beginning of a haystack (even with multi-line mode enabled)
\z              only the end of a haystack (even with multi-line mode enabled)
\b              a Unicode word boundary (\w on one side and \W, \A, or \z on other)
\B              not a Unicode word boundary
\b{start}, \<   a Unicode start-of-word boundary (\W|\A on the left, \w on the right)
\b{end}, \>     a Unicode end-of-word boundary (\w on the left, \W|\z on the right))
\b{start-half}  half of a Unicode start-of-word boundary (\W|\A on the left)
\b{end-half}    half of a Unicode end-of-word boundary (\W|\z on the right)
The empty regex is valid and matches the empty string. For example, the empty regex matches `abc` at positions `0`, `1`, `2` and `3`. When using the top-level [`Regex`] on `&str` haystacks, an empty match that splits a codepoint is guaranteed to never be returned. However, such matches are permitted when using a [`bytes::Regex`]. For example: ```rust let re = regex::Regex::new(r"").unwrap(); let ranges: Vec<_> = re.find_iter("๐Ÿ’ฉ").map(|m| m.range()).collect(); assert_eq!(ranges, vec![0..0, 4..4]); let re = regex::bytes::Regex::new(r"").unwrap(); let ranges: Vec<_> = re.find_iter("๐Ÿ’ฉ".as_bytes()).map(|m| m.range()).collect(); assert_eq!(ranges, vec![0..0, 1..1, 2..2, 3..3, 4..4]); ``` Note that an empty regex is distinct from a regex that can never match. For example, the regex `[a&&b]` is a character class that represents the intersection of `a` and `b`. That intersection is empty, which means the character class is empty. Since nothing is in the empty set, `[a&&b]` matches nothing, not even the empty string. ### Grouping and flags
(exp)          numbered capture group (indexed by opening parenthesis)
(?P<name>exp)  named (also numbered) capture group (names must be alpha-numeric)
(?<name>exp)   named (also numbered) capture group (names must be alpha-numeric)
(?:exp)        non-capturing group
(?flags)       set flags within current group
(?flags:exp)   set flags for exp (non-capturing)
Capture group names must be any sequence of alpha-numeric Unicode codepoints, in addition to `.`, `_`, `[` and `]`. Names must start with either an `_` or an alphabetic codepoint. Alphabetic codepoints correspond to the `Alphabetic` Unicode property, while numeric codepoints correspond to the union of the `Decimal_Number`, `Letter_Number` and `Other_Number` general categories. Flags are each a single character. For example, `(?x)` sets the flag `x` and `(?-x)` clears the flag `x`. Multiple flags can be set or cleared at the same time: `(?xy)` sets both the `x` and `y` flags and `(?x-y)` sets the `x` flag and clears the `y` flag. All flags are by default disabled unless stated otherwise. They are:
i     case-insensitive: letters match both upper and lower case
m     multi-line mode: ^ and $ match begin/end of line
s     allow . to match \n
R     enables CRLF mode: when multi-line mode is enabled, \r\n is used
U     swap the meaning of x* and x*?
u     Unicode support (enabled by default)
x     verbose mode, ignores whitespace and allow line comments (starting with `#`)
Note that in verbose mode, whitespace is ignored everywhere, including within character classes. To insert whitespace, use its escaped form or a hex literal. For example, `\ ` or `\x20` for an ASCII space. Flags can be toggled within a pattern. Here's an example that matches case-insensitively for the first part but case-sensitively for the second part: ```rust use regex::Regex; let re = Regex::new(r"(?i)a+(?-i)b+").unwrap(); let m = re.find("AaAaAbbBBBb").unwrap(); assert_eq!(m.as_str(), "AaAaAbb"); ``` Notice that the `a+` matches either `a` or `A`, but the `b+` only matches `b`. Multi-line mode means `^` and `$` no longer match just at the beginning/end of the input, but also at the beginning/end of lines: ``` use regex::Regex; let re = Regex::new(r"(?m)^line \d+").unwrap(); let m = re.find("line one\nline 2\n").unwrap(); assert_eq!(m.as_str(), "line 2"); ``` Note that `^` matches after new lines, even at the end of input: ``` use regex::Regex; let re = Regex::new(r"(?m)^").unwrap(); let m = re.find_iter("test\n").last().unwrap(); assert_eq!((m.start(), m.end()), (5, 5)); ``` When both CRLF mode and multi-line mode are enabled, then `^` and `$` will match either `\r` or `\n`, but never in the middle of a `\r\n`: ``` use regex::Regex; let re = Regex::new(r"(?mR)^foo$").unwrap(); let m = re.find("\r\nfoo\r\n").unwrap(); assert_eq!(m.as_str(), "foo"); ``` Unicode mode can also be selectively disabled, although only when the result *would not* match invalid UTF-8. One good example of this is using an ASCII word boundary instead of a Unicode word boundary, which might make some regex searches run faster: ```rust use regex::Regex; let re = Regex::new(r"(?-u:\b).+(?-u:\b)").unwrap(); let m = re.find("$$abc$$").unwrap(); assert_eq!(m.as_str(), "abc"); ``` ### Escape sequences Note that this includes all possible escape sequences, even ones that are documented elsewhere.
\*              literal *, applies to all ASCII except [0-9A-Za-z<>]
\a              bell (\x07)
\f              form feed (\x0C)
\t              horizontal tab
\n              new line
\r              carriage return
\v              vertical tab (\x0B)
\A              matches at the beginning of a haystack
\z              matches at the end of a haystack
\b              word boundary assertion
\B              negated word boundary assertion
\b{start}, \<   start-of-word boundary assertion
\b{end}, \>     end-of-word boundary assertion
\b{start-half}  half of a start-of-word boundary assertion
\b{end-half}    half of a end-of-word boundary assertion
\123            octal character code, up to three digits (when enabled)
\x7F            hex character code (exactly two digits)
\x{10FFFF}      any hex character code corresponding to a Unicode code point
\u007F          hex character code (exactly four digits)
\u{7F}          any hex character code corresponding to a Unicode code point
\U0000007F      hex character code (exactly eight digits)
\U{7F}          any hex character code corresponding to a Unicode code point
\p{Letter}      Unicode character class
\P{Letter}      negated Unicode character class
\d, \s, \w      Perl character class
\D, \S, \W      negated Perl character class
### Perl character classes (Unicode friendly) These classes are based on the definitions provided in [UTS#18](https://www.unicode.org/reports/tr18/#Compatibility_Properties):
\d     digit (\p{Nd})
\D     not digit
\s     whitespace (\p{White_Space})
\S     not whitespace
\w     word character (\p{Alphabetic} + \p{M} + \d + \p{Pc} + \p{Join_Control})
\W     not word character
### ASCII character classes These classes are based on the definitions provided in [UTS#18](https://www.unicode.org/reports/tr18/#Compatibility_Properties):
[[:alnum:]]    alphanumeric ([0-9A-Za-z])
[[:alpha:]]    alphabetic ([A-Za-z])
[[:ascii:]]    ASCII ([\x00-\x7F])
[[:blank:]]    blank ([\t ])
[[:cntrl:]]    control ([\x00-\x1F\x7F])
[[:digit:]]    digits ([0-9])
[[:graph:]]    graphical ([!-~])
[[:lower:]]    lower case ([a-z])
[[:print:]]    printable ([ -~])
[[:punct:]]    punctuation ([!-/:-@\[-`{-~])
[[:space:]]    whitespace ([\t\n\v\f\r ])
[[:upper:]]    upper case ([A-Z])
[[:word:]]     word characters ([0-9A-Za-z_])
[[:xdigit:]]   hex digit ([0-9A-Fa-f])
# Untrusted input This crate is meant to be able to run regex searches on untrusted haystacks without fear of [ReDoS]. This crate also, to a certain extent, supports untrusted patterns. [ReDoS]: https://en.wikipedia.org/wiki/ReDoS This crate differs from most (but not all) other regex engines in that it doesn't use unbounded backtracking to run a regex search. In those cases, one generally cannot use untrusted patterns *or* untrusted haystacks because it can be very difficult to know whether a particular pattern will result in catastrophic backtracking or not. We'll first discuss how this crate deals with untrusted inputs and then wrap it up with a realistic discussion about what practice really looks like. ### Panics Outside of clearly documented cases, most APIs in this crate are intended to never panic regardless of the inputs given to them. For example, `Regex::new`, `Regex::is_match`, `Regex::find` and `Regex::captures` should never panic. That is, it is an API promise that those APIs will never panic no matter what inputs are given to them. With that said, regex engines are complicated beasts, and providing a rock solid guarantee that these APIs literally never panic is essentially equivalent to saying, "there are no bugs in this library." That is a bold claim, and not really one that can be feasibly made with a straight face. Don't get the wrong impression here. This crate is extensively tested, not just with unit and integration tests, but also via fuzz testing. For example, this crate is part of the [OSS-fuzz project]. Panics should be incredibly rare, but it is possible for bugs to exist, and thus possible for a panic to occur. If you need a rock solid guarantee against panics, then you should wrap calls into this library with [`std::panic::catch_unwind`]. It's also worth pointing out that this library will *generally* panic when other regex engines would commit undefined behavior. When undefined behavior occurs, your program might continue as if nothing bad has happened, but it also might mean your program is open to the worst kinds of exploits. In contrast, the worst thing a panic can do is a denial of service. [OSS-fuzz project]: https://android.googlesource.com/platform/external/oss-fuzz/+/refs/tags/android-t-preview-1/projects/rust-regex/ [`std::panic::catch_unwind`]: https://doc.rust-lang.org/std/panic/fn.catch_unwind.html ### Untrusted patterns The principal way this crate deals with them is by limiting their size by default. The size limit can be configured via [`RegexBuilder::size_limit`]. The idea of a size limit is that compiling a pattern into a `Regex` will fail if it becomes "too big." Namely, while *most* resources consumed by compiling a regex are approximately proportional (albeit with some high constant factors in some cases, such as with Unicode character classes) to the length of the pattern itself, there is one particular exception to this: counted repetitions. Namely, this pattern: ```text a{5}{5}{5}{5}{5}{5} ``` Is equivalent to this pattern: ```text a{15625} ``` In both of these cases, the actual pattern string is quite small, but the resulting `Regex` value is quite large. Indeed, as the first pattern shows, it isn't enough to locally limit the size of each repetition because they can be stacked in a way that results in exponential growth. To provide a bit more context, a simplified view of regex compilation looks like this: * The pattern string is parsed into a structured representation called an AST. Counted repetitions are not expanded and Unicode character classes are not looked up in this stage. That is, the size of the AST is proportional to the size of the pattern with "reasonable" constant factors. In other words, one can reasonably limit the memory used by an AST by limiting the length of the pattern string. * The AST is translated into an HIR. Counted repetitions are still *not* expanded at this stage, but Unicode character classes are embedded into the HIR. The memory usage of a HIR is still proportional to the length of the original pattern string, but the constant factors---mostly as a result of Unicode character classes---can be quite high. Still though, the memory used by an HIR can be reasonably limited by limiting the length of the pattern string. * The HIR is compiled into a [Thompson NFA]. This is the stage at which something like `\w{5}` is rewritten to `\w\w\w\w\w`. Thus, this is the stage at which [`RegexBuilder::size_limit`] is enforced. If the NFA exceeds the configured size, then this stage will fail. [Thompson NFA]: https://en.wikipedia.org/wiki/Thompson%27s_construction The size limit helps avoid two different kinds of exorbitant resource usage: * It avoids permitting exponential memory usage based on the size of the pattern string. * It avoids long search times. This will be discussed in more detail in the next section, but worst case search time *is* dependent on the size of the regex. So keeping regexes limited to a reasonable size is also a way of keeping search times reasonable. Finally, it's worth pointing out that regex compilation is guaranteed to take worst case `O(m)` time, where `m` is proportional to the size of regex. The size of the regex here is *after* the counted repetitions have been expanded. **Advice for those using untrusted regexes**: limit the pattern length to something small and expand it as needed. Configure [`RegexBuilder::size_limit`] to something small and then expand it as needed. ### Untrusted haystacks The main way this crate guards against searches from taking a long time is by using algorithms that guarantee a `O(m * n)` worst case time and space bound. Namely: * `m` is proportional to the size of the regex, where the size of the regex includes the expansion of all counted repetitions. (See the previous section on untrusted patterns.) * `n` is proportional to the length, in bytes, of the haystack. In other words, if you consider `m` to be a constant (for example, the regex pattern is a literal in the source code), then the search can be said to run in "linear time." Or equivalently, "linear time with respect to the size of the haystack." But the `m` factor here is important not to ignore. If a regex is particularly big, the search times can get quite slow. This is why, in part, [`RegexBuilder::size_limit`] exists. **Advice for those searching untrusted haystacks**: As long as your regexes are not enormous, you should expect to be able to search untrusted haystacks without fear. If you aren't sure, you should benchmark it. Unlike backtracking engines, if your regex is so big that it's likely to result in slow searches, this is probably something you'll be able to observe regardless of what the haystack is made up of. ### Iterating over matches One thing that is perhaps easy to miss is that the worst case time complexity bound of `O(m * n)` applies to methods like [`Regex::is_match`], [`Regex::find`] and [`Regex::captures`]. It does **not** apply to [`Regex::find_iter`] or [`Regex::captures_iter`]. Namely, since iterating over all matches can execute many searches, and each search can scan the entire haystack, the worst case time complexity for iterators is `O(m * n^2)`. One example of where this occurs is when a pattern consists of an alternation, where an earlier branch of the alternation requires scanning the entire haystack only to discover that there is no match. It also requires a later branch of the alternation to have matched at the beginning of the search. For example, consider the pattern `.*[^A-Z]|[A-Z]` and the haystack `AAAAA`. The first search will scan to the end looking for matches of `.*[^A-Z]` even though a finite automata engine (as in this crate) knows that `[A-Z]` has already matched the first character of the haystack. This is due to the greedy nature of regex searching. That first search will report a match at the first `A` only after scanning to the end to discover that no other match exists. The next search then begins at the second `A` and the behavior repeats. There is no way to avoid this. This means that if both patterns and haystacks are untrusted and you're iterating over all matches, you're susceptible to worst case quadratic time complexity. One possible way to mitigate this is to drop down to the lower level `regex-automata` crate and use its `meta::Regex` iterator APIs. There, you can configure the search to operate in "earliest" mode by passing a `Input::new(haystack).earliest(true)` to `meta::Regex::find_iter` (for example). By enabling this mode, you give up the normal greedy match semantics of regex searches and instead ask the regex engine to immediately stop as soon as a match has been found. Enabling this mode will thus restore the worst case `O(m * n)` time complexity bound, but at the cost of different semantics. ### Untrusted inputs in practice While providing a `O(m * n)` worst case time bound on all searches goes a long way toward preventing [ReDoS], that doesn't mean every search you can possibly run will complete without burning CPU time. In general, there are a few ways for the `m * n` time bound to still bite you: * You are searching an exceptionally long haystack. No matter how you slice it, a longer haystack will take more time to search. This crate may often make very quick work of even long haystacks because of its literal optimizations, but those aren't available for all regexes. * Unicode character classes can cause searches to be quite slow in some cases. This is especially true when they are combined with counted repetitions. While the regex size limit above will protect you from the most egregious cases, the default size limit still permits pretty big regexes that can execute more slowly than one might expect. * While routines like [`Regex::find`] and [`Regex::captures`] guarantee worst case `O(m * n)` search time, routines like [`Regex::find_iter`] and [`Regex::captures_iter`] actually have worst case `O(m * n^2)` search time. This is because `find_iter` runs many searches, and each search takes worst case `O(m * n)` time. Thus, iteration of all matches in a haystack has worst case `O(m * n^2)`. A good example of a pattern that exhibits this is `(?:A+){1000}|` or even `.*[^A-Z]|[A-Z]`. In general, untrusted haystacks are easier to stomach than untrusted patterns. Untrusted patterns give a lot more control to the caller to impact the performance of a search. In many cases, a regex search will actually execute in average case `O(n)` time (i.e., not dependent on the size of the regex), but this can't be guaranteed in general. Therefore, permitting untrusted patterns means that your only line of defense is to put a limit on how big `m` (and perhaps also `n`) can be in `O(m * n)`. `n` is limited by simply inspecting the length of the haystack while `m` is limited by *both* applying a limit to the length of the pattern *and* a limit on the compiled size of the regex via [`RegexBuilder::size_limit`]. It bears repeating: if you're accepting untrusted patterns, it would be a good idea to start with conservative limits on `m` and `n`, and then carefully increase them as needed. # Crate features By default, this crate tries pretty hard to make regex matching both as fast as possible and as correct as it can be. This means that there is a lot of code dedicated to performance, the handling of Unicode data and the Unicode data itself. Overall, this leads to more dependencies, larger binaries and longer compile times. This trade off may not be appropriate in all cases, and indeed, even when all Unicode and performance features are disabled, one is still left with a perfectly serviceable regex engine that will work well in many cases. (Note that code is not arbitrarily reducible, and for this reason, the [`regex-lite`](https://docs.rs/regex-lite) crate exists to provide an even more minimal experience by cutting out Unicode and performance, but still maintaining the linear search time bound.) This crate exposes a number of features for controlling that trade off. Some of these features are strictly performance oriented, such that disabling them won't result in a loss of functionality, but may result in worse performance. Other features, such as the ones controlling the presence or absence of Unicode data, can result in a loss of functionality. For example, if one disables the `unicode-case` feature (described below), then compiling the regex `(?i)a` will fail since Unicode case insensitivity is enabled by default. Instead, callers must use `(?i-u)a` to disable Unicode case folding. Stated differently, enabling or disabling any of the features below can only add or subtract from the total set of valid regular expressions. Enabling or disabling a feature will never modify the match semantics of a regular expression. Most features below are enabled by default. Features that aren't enabled by default are noted. ### Ecosystem features * **std** - When enabled, this will cause `regex` to use the standard library. In terms of APIs, `std` causes error types to implement the `std::error::Error` trait. Enabling `std` will also result in performance optimizations, including SIMD and faster synchronization primitives. Notably, **disabling the `std` feature will result in the use of spin locks**. To use a regex engine without `std` and without spin locks, you'll need to drop down to the [`regex-automata`](https://docs.rs/regex-automata) crate. * **logging** - When enabled, the `log` crate is used to emit messages about regex compilation and search strategies. This is **disabled by default**. This is typically only useful to someone working on this crate's internals, but might be useful if you're doing some rabbit hole performance hacking. Or if you're just interested in the kinds of decisions being made by the regex engine. ### Performance features **Note**: To get performance benefits offered by the SIMD, `std` must be enabled. None of the `perf-*` features will enable `std` implicitly. * **perf** - Enables all performance related features except for `perf-dfa-full`. This feature is enabled by default is intended to cover all reasonable features that improve performance, even if more are added in the future. * **perf-dfa** - Enables the use of a lazy DFA for matching. The lazy DFA is used to compile portions of a regex to a very fast DFA on an as-needed basis. This can result in substantial speedups, usually by an order of magnitude on large haystacks. The lazy DFA does not bring in any new dependencies, but it can make compile times longer. * **perf-dfa-full** - Enables the use of a full DFA for matching. Full DFAs are problematic because they have worst case `O(2^n)` construction time. For this reason, when this feature is enabled, full DFAs are only used for very small regexes and a very small space bound is used during determinization to avoid the DFA from blowing up. This feature is not enabled by default, even as part of `perf`, because it results in fairly sizeable increases in binary size and compilation time. It can result in faster search times, but they tend to be more modest and limited to non-Unicode regexes. * **perf-onepass** - Enables the use of a one-pass DFA for extracting the positions of capture groups. This optimization applies to a subset of certain types of NFAs and represents the fastest engine in this crate for dealing with capture groups. * **perf-backtrack** - Enables the use of a bounded backtracking algorithm for extracting the positions of capture groups. This usually sits between the slowest engine (the PikeVM) and the fastest engine (one-pass DFA) for extracting capture groups. It's used whenever the regex is not one-pass and is small enough. * **perf-inline** - Enables the use of aggressive inlining inside match routines. This reduces the overhead of each match. The aggressive inlining, however, increases compile times and binary size. * **perf-literal** - Enables the use of literal optimizations for speeding up matches. In some cases, literal optimizations can result in speedups of _several_ orders of magnitude. Disabling this drops the `aho-corasick` and `memchr` dependencies. * **perf-cache** - This feature used to enable a faster internal cache at the cost of using additional dependencies, but this is no longer an option. A fast internal cache is now used unconditionally with no additional dependencies. This may change in the future. ### Unicode features * **unicode** - Enables all Unicode features. This feature is enabled by default, and will always cover all Unicode features, even if more are added in the future. * **unicode-age** - Provide the data for the [Unicode `Age` property](https://www.unicode.org/reports/tr44/tr44-24.html#Character_Age). This makes it possible to use classes like `\p{Age:6.0}` to refer to all codepoints first introduced in Unicode 6.0 * **unicode-bool** - Provide the data for numerous Unicode boolean properties. The full list is not included here, but contains properties like `Alphabetic`, `Emoji`, `Lowercase`, `Math`, `Uppercase` and `White_Space`. * **unicode-case** - Provide the data for case insensitive matching using [Unicode's "simple loose matches" specification](https://www.unicode.org/reports/tr18/#Simple_Loose_Matches). * **unicode-gencat** - Provide the data for [Unicode general categories](https://www.unicode.org/reports/tr44/tr44-24.html#General_Category_Values). This includes, but is not limited to, `Decimal_Number`, `Letter`, `Math_Symbol`, `Number` and `Punctuation`. * **unicode-perl** - Provide the data for supporting the Unicode-aware Perl character classes, corresponding to `\w`, `\s` and `\d`. This is also necessary for using Unicode-aware word boundary assertions. Note that if this feature is disabled, the `\s` and `\d` character classes are still available if the `unicode-bool` and `unicode-gencat` features are enabled, respectively. * **unicode-script** - Provide the data for [Unicode scripts and script extensions](https://www.unicode.org/reports/tr24/). This includes, but is not limited to, `Arabic`, `Cyrillic`, `Hebrew`, `Latin` and `Thai`. * **unicode-segment** - Provide the data necessary to provide the properties used to implement the [Unicode text segmentation algorithms](https://www.unicode.org/reports/tr29/). This enables using classes like `\p{gcb=Extend}`, `\p{wb=Katakana}` and `\p{sb=ATerm}`. # Other crates This crate has two required dependencies and several optional dependencies. This section briefly describes them with the goal of raising awareness of how different components of this crate may be used independently. It is somewhat unusual for a regex engine to have dependencies, as most regex libraries are self contained units with no dependencies other than a particular environment's standard library. Indeed, for other similarly optimized regex engines, most or all of the code in the dependencies of this crate would normally just be inseparable or coupled parts of the crate itself. But since Rust and its tooling ecosystem make the use of dependencies so easy, it made sense to spend some effort de-coupling parts of this crate and making them independently useful. We only briefly describe each crate here. * [`regex-lite`](https://docs.rs/regex-lite) is not a dependency of `regex`, but rather, a standalone zero-dependency simpler version of `regex` that prioritizes compile times and binary size. In exchange, it eschews Unicode support and performance. Its match semantics are as identical as possible to the `regex` crate, and for the things it supports, its APIs are identical to the APIs in this crate. In other words, for a lot of use cases, it is a drop-in replacement. * [`regex-syntax`](https://docs.rs/regex-syntax) provides a regular expression parser via `Ast` and `Hir` types. It also provides routines for extracting literals from a pattern. Folks can use this crate to do analysis, or even to build their own regex engine without having to worry about writing a parser. * [`regex-automata`](https://docs.rs/regex-automata) provides the regex engines themselves. One of the downsides of finite automata based regex engines is that they often need multiple internal engines in order to have similar or better performance than an unbounded backtracking engine in practice. `regex-automata` in particular provides public APIs for a PikeVM, a bounded backtracker, a one-pass DFA, a lazy DFA, a fully compiled DFA and a meta regex engine that combines all them together. It also has native multi-pattern support and provides a way to compile and serialize full DFAs such that they can be loaded and searched in a no-std no-alloc environment. `regex-automata` itself doesn't even have a required dependency on `regex-syntax`! * [`memchr`](https://docs.rs/memchr) provides low level SIMD vectorized routines for quickly finding the location of single bytes or even substrings in a haystack. In other words, it provides fast `memchr` and `memmem` routines. These are used by this crate in literal optimizations. * [`aho-corasick`](https://docs.rs/aho-corasick) provides multi-substring search. It also provides SIMD vectorized routines in the case where the number of substrings to search for is relatively small. The `regex` crate also uses this for literal optimizations. */ #![no_std] #![deny(missing_docs)] #![cfg_attr(feature = "pattern", feature(pattern))] // This adds Cargo feature annotations to items in the rustdoc output. Which is // sadly hugely beneficial for this crate due to the number of features. #![cfg_attr(docsrs_regex, feature(doc_cfg))] #![warn(missing_debug_implementations)] #[cfg(doctest)] doc_comment::doctest!("../README.md"); extern crate alloc; #[cfg(any(test, feature = "std"))] extern crate std; pub use crate::error::Error; pub use crate::{builders::string::*, regex::string::*, regexset::string::*}; mod builders; pub mod bytes; mod error; mod find_byte; #[cfg(feature = "pattern")] mod pattern; mod regex; mod regexset; /// Escapes all regular expression meta characters in `pattern`. /// /// The string returned may be safely used as a literal in a regular /// expression. pub fn escape(pattern: &str) -> alloc::string::String { regex_syntax::escape(pattern) } regex-1.12.2/src/pattern.rs000064400000000000000000000036011046102023000136330ustar 00000000000000use core::str::pattern::{Pattern, SearchStep, Searcher, Utf8Pattern}; use crate::{Matches, Regex}; #[derive(Debug)] pub struct RegexSearcher<'r, 't> { haystack: &'t str, it: Matches<'r, 't>, last_step_end: usize, next_match: Option<(usize, usize)>, } impl<'r> Pattern for &'r Regex { type Searcher<'t> = RegexSearcher<'r, 't>; fn into_searcher<'t>(self, haystack: &'t str) -> RegexSearcher<'r, 't> { RegexSearcher { haystack, it: self.find_iter(haystack), last_step_end: 0, next_match: None, } } fn as_utf8_pattern<'p>(&'p self) -> Option> { None } } unsafe impl<'r, 't> Searcher<'t> for RegexSearcher<'r, 't> { #[inline] fn haystack(&self) -> &'t str { self.haystack } #[inline] fn next(&mut self) -> SearchStep { if let Some((s, e)) = self.next_match { self.next_match = None; self.last_step_end = e; return SearchStep::Match(s, e); } match self.it.next() { None => { if self.last_step_end < self.haystack().len() { let last = self.last_step_end; self.last_step_end = self.haystack().len(); SearchStep::Reject(last, self.haystack().len()) } else { SearchStep::Done } } Some(m) => { let (s, e) = (m.start(), m.end()); if s == self.last_step_end { self.last_step_end = e; SearchStep::Match(s, e) } else { self.next_match = Some((s, e)); let last = self.last_step_end; self.last_step_end = s; SearchStep::Reject(last, s) } } } } } regex-1.12.2/src/regex/bytes.rs000064400000000000000000003025251046102023000144250ustar 00000000000000use alloc::{borrow::Cow, string::String, sync::Arc, vec::Vec}; use regex_automata::{meta, util::captures, Input, PatternID}; use crate::{bytes::RegexBuilder, error::Error}; /// A compiled regular expression for searching Unicode haystacks. /// /// A `Regex` can be used to search haystacks, split haystacks into substrings /// or replace substrings in a haystack with a different substring. All /// searching is done with an implicit `(?s:.)*?` at the beginning and end of /// an pattern. To force an expression to match the whole string (or a prefix /// or a suffix), you must use an anchor like `^` or `$` (or `\A` and `\z`). /// /// Like the `Regex` type in the parent module, matches with this regex return /// byte offsets into the haystack. **Unlike** the parent `Regex` type, these /// byte offsets may not correspond to UTF-8 sequence boundaries since the /// regexes in this module can match arbitrary bytes. /// /// The only methods that allocate new byte strings are the string replacement /// methods. All other methods (searching and splitting) return borrowed /// references into the haystack given. /// /// # Example /// /// Find the offsets of a US phone number: /// /// ``` /// use regex::bytes::Regex; /// /// let re = Regex::new("[0-9]{3}-[0-9]{3}-[0-9]{4}").unwrap(); /// let m = re.find(b"phone: 111-222-3333").unwrap(); /// assert_eq!(7..19, m.range()); /// ``` /// /// # Example: extracting capture groups /// /// A common way to use regexes is with capture groups. That is, instead of /// just looking for matches of an entire regex, parentheses are used to create /// groups that represent part of the match. /// /// For example, consider a haystack with multiple lines, and each line has /// three whitespace delimited fields where the second field is expected to be /// a number and the third field a boolean. To make this convenient, we use /// the [`Captures::extract`] API to put the strings that match each group /// into a fixed size array: /// /// ``` /// use regex::bytes::Regex; /// /// let hay = b" /// rabbit 54 true /// groundhog 2 true /// does not match /// fox 109 false /// "; /// let re = Regex::new(r"(?m)^\s*(\S+)\s+([0-9]+)\s+(true|false)\s*$").unwrap(); /// let mut fields: Vec<(&[u8], i64, bool)> = vec![]; /// for (_, [f1, f2, f3]) in re.captures_iter(hay).map(|caps| caps.extract()) { /// // These unwraps are OK because our pattern is written in a way where /// // all matches for f2 and f3 will be valid UTF-8. /// let f2 = std::str::from_utf8(f2).unwrap(); /// let f3 = std::str::from_utf8(f3).unwrap(); /// fields.push((f1, f2.parse()?, f3.parse()?)); /// } /// assert_eq!(fields, vec![ /// (&b"rabbit"[..], 54, true), /// (&b"groundhog"[..], 2, true), /// (&b"fox"[..], 109, false), /// ]); /// /// # Ok::<(), Box>(()) /// ``` /// /// # Example: matching invalid UTF-8 /// /// One of the reasons for searching `&[u8]` haystacks is that the `&[u8]` /// might not be valid UTF-8. Indeed, with a `bytes::Regex`, patterns that /// match invalid UTF-8 are explicitly allowed. Here's one example that looks /// for valid UTF-8 fields that might be separated by invalid UTF-8. In this /// case, we use `(?s-u:.)`, which matches any byte. Attempting to use it in a /// top-level `Regex` will result in the regex failing to compile. Notice also /// that we use `.` with Unicode mode enabled, in which case, only valid UTF-8 /// is matched. In this way, we can build one pattern where some parts only /// match valid UTF-8 while other parts are more permissive. /// /// ``` /// use regex::bytes::Regex; /// /// // F0 9F 92 A9 is the UTF-8 encoding for a Pile of Poo. /// let hay = b"\xFF\xFFfoo\xFF\xFF\xFF\xF0\x9F\x92\xA9\xFF"; /// // An equivalent to '(?s-u:.)' is '(?-u:[\x00-\xFF])'. /// let re = Regex::new(r"(?s)(?-u:.)*?(?.+)(?-u:.)*?(?.+)").unwrap(); /// let caps = re.captures(hay).unwrap(); /// assert_eq!(&caps["f1"], &b"foo"[..]); /// assert_eq!(&caps["f2"], "๐Ÿ’ฉ".as_bytes()); /// ``` #[derive(Clone)] pub struct Regex { pub(crate) meta: meta::Regex, pub(crate) pattern: Arc, } impl core::fmt::Display for Regex { /// Shows the original regular expression. fn fmt(&self, f: &mut core::fmt::Formatter<'_>) -> core::fmt::Result { write!(f, "{}", self.as_str()) } } impl core::fmt::Debug for Regex { /// Shows the original regular expression. fn fmt(&self, f: &mut core::fmt::Formatter<'_>) -> core::fmt::Result { f.debug_tuple("Regex").field(&self.as_str()).finish() } } impl core::str::FromStr for Regex { type Err = Error; /// Attempts to parse a string into a regular expression fn from_str(s: &str) -> Result { Regex::new(s) } } impl TryFrom<&str> for Regex { type Error = Error; /// Attempts to parse a string into a regular expression fn try_from(s: &str) -> Result { Regex::new(s) } } impl TryFrom for Regex { type Error = Error; /// Attempts to parse a string into a regular expression fn try_from(s: String) -> Result { Regex::new(&s) } } /// Core regular expression methods. impl Regex { /// Compiles a regular expression. Once compiled, it can be used repeatedly /// to search, split or replace substrings in a haystack. /// /// Note that regex compilation tends to be a somewhat expensive process, /// and unlike higher level environments, compilation is not automatically /// cached for you. One should endeavor to compile a regex once and then /// reuse it. For example, it's a bad idea to compile the same regex /// repeatedly in a loop. /// /// # Errors /// /// If an invalid pattern is given, then an error is returned. /// An error is also returned if the pattern is valid, but would /// produce a regex that is bigger than the configured size limit via /// [`RegexBuilder::size_limit`]. (A reasonable size limit is enabled by /// default.) /// /// # Example /// /// ``` /// use regex::bytes::Regex; /// /// // An Invalid pattern because of an unclosed parenthesis /// assert!(Regex::new(r"foo(bar").is_err()); /// // An invalid pattern because the regex would be too big /// // because Unicode tends to inflate things. /// assert!(Regex::new(r"\w{1000}").is_err()); /// // Disabling Unicode can make the regex much smaller, /// // potentially by up to or more than an order of magnitude. /// assert!(Regex::new(r"(?-u:\w){1000}").is_ok()); /// ``` pub fn new(re: &str) -> Result { RegexBuilder::new(re).build() } /// Returns true if and only if there is a match for the regex anywhere /// in the haystack given. /// /// It is recommended to use this method if all you need to do is test /// whether a match exists, since the underlying matching engine may be /// able to do less work. /// /// # Example /// /// Test if some haystack contains at least one word with exactly 13 /// Unicode word characters: /// /// ``` /// use regex::bytes::Regex; /// /// let re = Regex::new(r"\b\w{13}\b").unwrap(); /// let hay = b"I categorically deny having triskaidekaphobia."; /// assert!(re.is_match(hay)); /// ``` #[inline] pub fn is_match(&self, haystack: &[u8]) -> bool { self.is_match_at(haystack, 0) } /// This routine searches for the first match of this regex in the /// haystack given, and if found, returns a [`Match`]. The `Match` /// provides access to both the byte offsets of the match and the actual /// substring that matched. /// /// Note that this should only be used if you want to find the entire /// match. If instead you just want to test the existence of a match, /// it's potentially faster to use `Regex::is_match(hay)` instead of /// `Regex::find(hay).is_some()`. /// /// # Example /// /// Find the first word with exactly 13 Unicode word characters: /// /// ``` /// use regex::bytes::Regex; /// /// let re = Regex::new(r"\b\w{13}\b").unwrap(); /// let hay = b"I categorically deny having triskaidekaphobia."; /// let mat = re.find(hay).unwrap(); /// assert_eq!(2..15, mat.range()); /// assert_eq!(b"categorically", mat.as_bytes()); /// ``` #[inline] pub fn find<'h>(&self, haystack: &'h [u8]) -> Option> { self.find_at(haystack, 0) } /// Returns an iterator that yields successive non-overlapping matches in /// the given haystack. The iterator yields values of type [`Match`]. /// /// # Time complexity /// /// Note that since `find_iter` runs potentially many searches on the /// haystack and since each search has worst case `O(m * n)` time /// complexity, the overall worst case time complexity for iteration is /// `O(m * n^2)`. /// /// # Example /// /// Find every word with exactly 13 Unicode word characters: /// /// ``` /// use regex::bytes::Regex; /// /// let re = Regex::new(r"\b\w{13}\b").unwrap(); /// let hay = b"Retroactively relinquishing remunerations is reprehensible."; /// let matches: Vec<_> = re.find_iter(hay).map(|m| m.as_bytes()).collect(); /// assert_eq!(matches, vec![ /// &b"Retroactively"[..], /// &b"relinquishing"[..], /// &b"remunerations"[..], /// &b"reprehensible"[..], /// ]); /// ``` #[inline] pub fn find_iter<'r, 'h>(&'r self, haystack: &'h [u8]) -> Matches<'r, 'h> { Matches { haystack, it: self.meta.find_iter(haystack) } } /// This routine searches for the first match of this regex in the haystack /// given, and if found, returns not only the overall match but also the /// matches of each capture group in the regex. If no match is found, then /// `None` is returned. /// /// Capture group `0` always corresponds to an implicit unnamed group that /// includes the entire match. If a match is found, this group is always /// present. Subsequent groups may be named and are numbered, starting /// at 1, by the order in which the opening parenthesis appears in the /// pattern. For example, in the pattern `(?.(?.))(?.)`, `a`, /// `b` and `c` correspond to capture group indices `1`, `2` and `3`, /// respectively. /// /// You should only use `captures` if you need access to the capture group /// matches. Otherwise, [`Regex::find`] is generally faster for discovering /// just the overall match. /// /// # Example /// /// Say you have some haystack with movie names and their release years, /// like "'Citizen Kane' (1941)". It'd be nice if we could search for /// strings looking like that, while also extracting the movie name and its /// release year separately. The example below shows how to do that. /// /// ``` /// use regex::bytes::Regex; /// /// let re = Regex::new(r"'([^']+)'\s+\((\d{4})\)").unwrap(); /// let hay = b"Not my favorite movie: 'Citizen Kane' (1941)."; /// let caps = re.captures(hay).unwrap(); /// assert_eq!(caps.get(0).unwrap().as_bytes(), b"'Citizen Kane' (1941)"); /// assert_eq!(caps.get(1).unwrap().as_bytes(), b"Citizen Kane"); /// assert_eq!(caps.get(2).unwrap().as_bytes(), b"1941"); /// // You can also access the groups by index using the Index notation. /// // Note that this will panic on an invalid index. In this case, these /// // accesses are always correct because the overall regex will only /// // match when these capture groups match. /// assert_eq!(&caps[0], b"'Citizen Kane' (1941)"); /// assert_eq!(&caps[1], b"Citizen Kane"); /// assert_eq!(&caps[2], b"1941"); /// ``` /// /// Note that the full match is at capture group `0`. Each subsequent /// capture group is indexed by the order of its opening `(`. /// /// We can make this example a bit clearer by using *named* capture groups: /// /// ``` /// use regex::bytes::Regex; /// /// let re = Regex::new(r"'(?[^']+)'\s+\((?<year>\d{4})\)").unwrap(); /// let hay = b"Not my favorite movie: 'Citizen Kane' (1941)."; /// let caps = re.captures(hay).unwrap(); /// assert_eq!(caps.get(0).unwrap().as_bytes(), b"'Citizen Kane' (1941)"); /// assert_eq!(caps.name("title").unwrap().as_bytes(), b"Citizen Kane"); /// assert_eq!(caps.name("year").unwrap().as_bytes(), b"1941"); /// // You can also access the groups by name using the Index notation. /// // Note that this will panic on an invalid group name. In this case, /// // these accesses are always correct because the overall regex will /// // only match when these capture groups match. /// assert_eq!(&caps[0], b"'Citizen Kane' (1941)"); /// assert_eq!(&caps["title"], b"Citizen Kane"); /// assert_eq!(&caps["year"], b"1941"); /// ``` /// /// Here we name the capture groups, which we can access with the `name` /// method or the `Index` notation with a `&str`. Note that the named /// capture groups are still accessible with `get` or the `Index` notation /// with a `usize`. /// /// The `0`th capture group is always unnamed, so it must always be /// accessed with `get(0)` or `[0]`. /// /// Finally, one other way to get the matched substrings is with the /// [`Captures::extract`] API: /// /// ``` /// use regex::bytes::Regex; /// /// let re = Regex::new(r"'([^']+)'\s+\((\d{4})\)").unwrap(); /// let hay = b"Not my favorite movie: 'Citizen Kane' (1941)."; /// let (full, [title, year]) = re.captures(hay).unwrap().extract(); /// assert_eq!(full, b"'Citizen Kane' (1941)"); /// assert_eq!(title, b"Citizen Kane"); /// assert_eq!(year, b"1941"); /// ``` #[inline] pub fn captures<'h>(&self, haystack: &'h [u8]) -> Option<Captures<'h>> { self.captures_at(haystack, 0) } /// Returns an iterator that yields successive non-overlapping matches in /// the given haystack. The iterator yields values of type [`Captures`]. /// /// This is the same as [`Regex::find_iter`], but instead of only providing /// access to the overall match, each value yield includes access to the /// matches of all capture groups in the regex. Reporting this extra match /// data is potentially costly, so callers should only use `captures_iter` /// over `find_iter` when they actually need access to the capture group /// matches. /// /// # Time complexity /// /// Note that since `captures_iter` runs potentially many searches on the /// haystack and since each search has worst case `O(m * n)` time /// complexity, the overall worst case time complexity for iteration is /// `O(m * n^2)`. /// /// # Example /// /// We can use this to find all movie titles and their release years in /// some haystack, where the movie is formatted like "'Title' (xxxx)": /// /// ``` /// use regex::bytes::Regex; /// /// let re = Regex::new(r"'([^']+)'\s+\(([0-9]{4})\)").unwrap(); /// let hay = b"'Citizen Kane' (1941), 'The Wizard of Oz' (1939), 'M' (1931)."; /// let mut movies = vec![]; /// for (_, [title, year]) in re.captures_iter(hay).map(|c| c.extract()) { /// // OK because [0-9]{4} can only match valid UTF-8. /// let year = std::str::from_utf8(year).unwrap(); /// movies.push((title, year.parse::<i64>()?)); /// } /// assert_eq!(movies, vec![ /// (&b"Citizen Kane"[..], 1941), /// (&b"The Wizard of Oz"[..], 1939), /// (&b"M"[..], 1931), /// ]); /// # Ok::<(), Box<dyn std::error::Error>>(()) /// ``` /// /// Or with named groups: /// /// ``` /// use regex::bytes::Regex; /// /// let re = Regex::new(r"'(?<title>[^']+)'\s+\((?<year>[0-9]{4})\)").unwrap(); /// let hay = b"'Citizen Kane' (1941), 'The Wizard of Oz' (1939), 'M' (1931)."; /// let mut it = re.captures_iter(hay); /// /// let caps = it.next().unwrap(); /// assert_eq!(&caps["title"], b"Citizen Kane"); /// assert_eq!(&caps["year"], b"1941"); /// /// let caps = it.next().unwrap(); /// assert_eq!(&caps["title"], b"The Wizard of Oz"); /// assert_eq!(&caps["year"], b"1939"); /// /// let caps = it.next().unwrap(); /// assert_eq!(&caps["title"], b"M"); /// assert_eq!(&caps["year"], b"1931"); /// ``` #[inline] pub fn captures_iter<'r, 'h>( &'r self, haystack: &'h [u8], ) -> CaptureMatches<'r, 'h> { CaptureMatches { haystack, it: self.meta.captures_iter(haystack) } } /// Returns an iterator of substrings of the haystack given, delimited by a /// match of the regex. Namely, each element of the iterator corresponds to /// a part of the haystack that *isn't* matched by the regular expression. /// /// # Time complexity /// /// Since iterators over all matches requires running potentially many /// searches on the haystack, and since each search has worst case /// `O(m * n)` time complexity, the overall worst case time complexity for /// this routine is `O(m * n^2)`. /// /// # Example /// /// To split a string delimited by arbitrary amounts of spaces or tabs: /// /// ``` /// use regex::bytes::Regex; /// /// let re = Regex::new(r"[ \t]+").unwrap(); /// let hay = b"a b \t c\td e"; /// let fields: Vec<&[u8]> = re.split(hay).collect(); /// assert_eq!(fields, vec![ /// &b"a"[..], &b"b"[..], &b"c"[..], &b"d"[..], &b"e"[..], /// ]); /// ``` /// /// # Example: more cases /// /// Basic usage: /// /// ``` /// use regex::bytes::Regex; /// /// let re = Regex::new(r" ").unwrap(); /// let hay = b"Mary had a little lamb"; /// let got: Vec<&[u8]> = re.split(hay).collect(); /// assert_eq!(got, vec![ /// &b"Mary"[..], &b"had"[..], &b"a"[..], &b"little"[..], &b"lamb"[..], /// ]); /// /// let re = Regex::new(r"X").unwrap(); /// let hay = b""; /// let got: Vec<&[u8]> = re.split(hay).collect(); /// assert_eq!(got, vec![&b""[..]]); /// /// let re = Regex::new(r"X").unwrap(); /// let hay = b"lionXXtigerXleopard"; /// let got: Vec<&[u8]> = re.split(hay).collect(); /// assert_eq!(got, vec![ /// &b"lion"[..], &b""[..], &b"tiger"[..], &b"leopard"[..], /// ]); /// /// let re = Regex::new(r"::").unwrap(); /// let hay = b"lion::tiger::leopard"; /// let got: Vec<&[u8]> = re.split(hay).collect(); /// assert_eq!(got, vec![&b"lion"[..], &b"tiger"[..], &b"leopard"[..]]); /// ``` /// /// If a haystack contains multiple contiguous matches, you will end up /// with empty spans yielded by the iterator: /// /// ``` /// use regex::bytes::Regex; /// /// let re = Regex::new(r"X").unwrap(); /// let hay = b"XXXXaXXbXc"; /// let got: Vec<&[u8]> = re.split(hay).collect(); /// assert_eq!(got, vec![ /// &b""[..], &b""[..], &b""[..], &b""[..], /// &b"a"[..], &b""[..], &b"b"[..], &b"c"[..], /// ]); /// /// let re = Regex::new(r"/").unwrap(); /// let hay = b"(///)"; /// let got: Vec<&[u8]> = re.split(hay).collect(); /// assert_eq!(got, vec![&b"("[..], &b""[..], &b""[..], &b")"[..]]); /// ``` /// /// Separators at the start or end of a haystack are neighbored by empty /// substring. /// /// ``` /// use regex::bytes::Regex; /// /// let re = Regex::new(r"0").unwrap(); /// let hay = b"010"; /// let got: Vec<&[u8]> = re.split(hay).collect(); /// assert_eq!(got, vec![&b""[..], &b"1"[..], &b""[..]]); /// ``` /// /// When the regex can match the empty string, it splits at every byte /// position in the haystack. This includes between all UTF-8 code units. /// (The top-level [`Regex::split`](crate::Regex::split) will only split /// at valid UTF-8 boundaries.) /// /// ``` /// use regex::bytes::Regex; /// /// let re = Regex::new(r"").unwrap(); /// let hay = "โ˜ƒ".as_bytes(); /// let got: Vec<&[u8]> = re.split(hay).collect(); /// assert_eq!(got, vec![ /// &[][..], &[b'\xE2'][..], &[b'\x98'][..], &[b'\x83'][..], &[][..], /// ]); /// ``` /// /// Contiguous separators (commonly shows up with whitespace), can lead to /// possibly surprising behavior. For example, this code is correct: /// /// ``` /// use regex::bytes::Regex; /// /// let re = Regex::new(r" ").unwrap(); /// let hay = b" a b c"; /// let got: Vec<&[u8]> = re.split(hay).collect(); /// assert_eq!(got, vec![ /// &b""[..], &b""[..], &b""[..], &b""[..], /// &b"a"[..], &b""[..], &b"b"[..], &b"c"[..], /// ]); /// ``` /// /// It does *not* give you `["a", "b", "c"]`. For that behavior, you'd want /// to match contiguous space characters: /// /// ``` /// use regex::bytes::Regex; /// /// let re = Regex::new(r" +").unwrap(); /// let hay = b" a b c"; /// let got: Vec<&[u8]> = re.split(hay).collect(); /// // N.B. This does still include a leading empty span because ' +' /// // matches at the beginning of the haystack. /// assert_eq!(got, vec![&b""[..], &b"a"[..], &b"b"[..], &b"c"[..]]); /// ``` #[inline] pub fn split<'r, 'h>(&'r self, haystack: &'h [u8]) -> Split<'r, 'h> { Split { haystack, it: self.meta.split(haystack) } } /// Returns an iterator of at most `limit` substrings of the haystack /// given, delimited by a match of the regex. (A `limit` of `0` will return /// no substrings.) Namely, each element of the iterator corresponds to a /// part of the haystack that *isn't* matched by the regular expression. /// The remainder of the haystack that is not split will be the last /// element in the iterator. /// /// # Time complexity /// /// Since iterators over all matches requires running potentially many /// searches on the haystack, and since each search has worst case /// `O(m * n)` time complexity, the overall worst case time complexity for /// this routine is `O(m * n^2)`. /// /// Although note that the worst case time here has an upper bound given /// by the `limit` parameter. /// /// # Example /// /// Get the first two words in some haystack: /// /// ``` /// use regex::bytes::Regex; /// /// let re = Regex::new(r"\W+").unwrap(); /// let hay = b"Hey! How are you?"; /// let fields: Vec<&[u8]> = re.splitn(hay, 3).collect(); /// assert_eq!(fields, vec![&b"Hey"[..], &b"How"[..], &b"are you?"[..]]); /// ``` /// /// # Examples: more cases /// /// ``` /// use regex::bytes::Regex; /// /// let re = Regex::new(r" ").unwrap(); /// let hay = b"Mary had a little lamb"; /// let got: Vec<&[u8]> = re.splitn(hay, 3).collect(); /// assert_eq!(got, vec![&b"Mary"[..], &b"had"[..], &b"a little lamb"[..]]); /// /// let re = Regex::new(r"X").unwrap(); /// let hay = b""; /// let got: Vec<&[u8]> = re.splitn(hay, 3).collect(); /// assert_eq!(got, vec![&b""[..]]); /// /// let re = Regex::new(r"X").unwrap(); /// let hay = b"lionXXtigerXleopard"; /// let got: Vec<&[u8]> = re.splitn(hay, 3).collect(); /// assert_eq!(got, vec![&b"lion"[..], &b""[..], &b"tigerXleopard"[..]]); /// /// let re = Regex::new(r"::").unwrap(); /// let hay = b"lion::tiger::leopard"; /// let got: Vec<&[u8]> = re.splitn(hay, 2).collect(); /// assert_eq!(got, vec![&b"lion"[..], &b"tiger::leopard"[..]]); /// /// let re = Regex::new(r"X").unwrap(); /// let hay = b"abcXdef"; /// let got: Vec<&[u8]> = re.splitn(hay, 1).collect(); /// assert_eq!(got, vec![&b"abcXdef"[..]]); /// /// let re = Regex::new(r"X").unwrap(); /// let hay = b"abcdef"; /// let got: Vec<&[u8]> = re.splitn(hay, 2).collect(); /// assert_eq!(got, vec![&b"abcdef"[..]]); /// /// let re = Regex::new(r"X").unwrap(); /// let hay = b"abcXdef"; /// let got: Vec<&[u8]> = re.splitn(hay, 0).collect(); /// assert!(got.is_empty()); /// ``` #[inline] pub fn splitn<'r, 'h>( &'r self, haystack: &'h [u8], limit: usize, ) -> SplitN<'r, 'h> { SplitN { haystack, it: self.meta.splitn(haystack, limit) } } /// Replaces the leftmost-first match in the given haystack with the /// replacement provided. The replacement can be a regular string (where /// `$N` and `$name` are expanded to match capture groups) or a function /// that takes a [`Captures`] and returns the replaced string. /// /// If no match is found, then the haystack is returned unchanged. In that /// case, this implementation will likely return a `Cow::Borrowed` value /// such that no allocation is performed. /// /// When a `Cow::Borrowed` is returned, the value returned is guaranteed /// to be equivalent to the `haystack` given. /// /// # Replacement string syntax /// /// All instances of `$ref` in the replacement string are replaced with /// the substring corresponding to the capture group identified by `ref`. /// /// `ref` may be an integer corresponding to the index of the capture group /// (counted by order of opening parenthesis where `0` is the entire match) /// or it can be a name (consisting of letters, digits or underscores) /// corresponding to a named capture group. /// /// If `ref` isn't a valid capture group (whether the name doesn't exist or /// isn't a valid index), then it is replaced with the empty string. /// /// The longest possible name is used. For example, `$1a` looks up the /// capture group named `1a` and not the capture group at index `1`. To /// exert more precise control over the name, use braces, e.g., `${1}a`. /// /// To write a literal `$` use `$$`. /// /// # Example /// /// Note that this function is polymorphic with respect to the replacement. /// In typical usage, this can just be a normal string: /// /// ``` /// use regex::bytes::Regex; /// /// let re = Regex::new(r"[^01]+").unwrap(); /// assert_eq!(re.replace(b"1078910", b""), &b"1010"[..]); /// ``` /// /// But anything satisfying the [`Replacer`] trait will work. For example, /// a closure of type `|&Captures| -> String` provides direct access to the /// captures corresponding to a match. This allows one to access capturing /// group matches easily: /// /// ``` /// use regex::bytes::{Captures, Regex}; /// /// let re = Regex::new(r"([^,\s]+),\s+(\S+)").unwrap(); /// let result = re.replace(b"Springsteen, Bruce", |caps: &Captures| { /// let mut buf = vec![]; /// buf.extend_from_slice(&caps[2]); /// buf.push(b' '); /// buf.extend_from_slice(&caps[1]); /// buf /// }); /// assert_eq!(result, &b"Bruce Springsteen"[..]); /// ``` /// /// But this is a bit cumbersome to use all the time. Instead, a simple /// syntax is supported (as described above) that expands `$name` into the /// corresponding capture group. Here's the last example, but using this /// expansion technique with named capture groups: /// /// ``` /// use regex::bytes::Regex; /// /// let re = Regex::new(r"(?<last>[^,\s]+),\s+(?<first>\S+)").unwrap(); /// let result = re.replace(b"Springsteen, Bruce", b"$first $last"); /// assert_eq!(result, &b"Bruce Springsteen"[..]); /// ``` /// /// Note that using `$2` instead of `$first` or `$1` instead of `$last` /// would produce the same result. To write a literal `$` use `$$`. /// /// Sometimes the replacement string requires use of curly braces to /// delineate a capture group replacement when it is adjacent to some other /// literal text. For example, if we wanted to join two words together with /// an underscore: /// /// ``` /// use regex::bytes::Regex; /// /// let re = Regex::new(r"(?<first>\w+)\s+(?<second>\w+)").unwrap(); /// let result = re.replace(b"deep fried", b"${first}_$second"); /// assert_eq!(result, &b"deep_fried"[..]); /// ``` /// /// Without the curly braces, the capture group name `first_` would be /// used, and since it doesn't exist, it would be replaced with the empty /// string. /// /// Finally, sometimes you just want to replace a literal string with no /// regard for capturing group expansion. This can be done by wrapping a /// string with [`NoExpand`]: /// /// ``` /// use regex::bytes::{NoExpand, Regex}; /// /// let re = Regex::new(r"(?<last>[^,\s]+),\s+(\S+)").unwrap(); /// let result = re.replace(b"Springsteen, Bruce", NoExpand(b"$2 $last")); /// assert_eq!(result, &b"$2 $last"[..]); /// ``` /// /// Using `NoExpand` may also be faster, since the replacement string won't /// need to be parsed for the `$` syntax. #[inline] pub fn replace<'h, R: Replacer>( &self, haystack: &'h [u8], rep: R, ) -> Cow<'h, [u8]> { self.replacen(haystack, 1, rep) } /// Replaces all non-overlapping matches in the haystack with the /// replacement provided. This is the same as calling `replacen` with /// `limit` set to `0`. /// /// If no match is found, then the haystack is returned unchanged. In that /// case, this implementation will likely return a `Cow::Borrowed` value /// such that no allocation is performed. /// /// When a `Cow::Borrowed` is returned, the value returned is guaranteed /// to be equivalent to the `haystack` given. /// /// The documentation for [`Regex::replace`] goes into more detail about /// what kinds of replacement strings are supported. /// /// # Time complexity /// /// Since iterators over all matches requires running potentially many /// searches on the haystack, and since each search has worst case /// `O(m * n)` time complexity, the overall worst case time complexity for /// this routine is `O(m * n^2)`. /// /// # Fallibility /// /// If you need to write a replacement routine where any individual /// replacement might "fail," doing so with this API isn't really feasible /// because there's no way to stop the search process if a replacement /// fails. Instead, if you need this functionality, you should consider /// implementing your own replacement routine: /// /// ``` /// use regex::bytes::{Captures, Regex}; /// /// fn replace_all<E>( /// re: &Regex, /// haystack: &[u8], /// replacement: impl Fn(&Captures) -> Result<Vec<u8>, E>, /// ) -> Result<Vec<u8>, E> { /// let mut new = Vec::with_capacity(haystack.len()); /// let mut last_match = 0; /// for caps in re.captures_iter(haystack) { /// let m = caps.get(0).unwrap(); /// new.extend_from_slice(&haystack[last_match..m.start()]); /// new.extend_from_slice(&replacement(&caps)?); /// last_match = m.end(); /// } /// new.extend_from_slice(&haystack[last_match..]); /// Ok(new) /// } /// /// // Let's replace each word with the number of bytes in that word. /// // But if we see a word that is "too long," we'll give up. /// let re = Regex::new(r"\w+").unwrap(); /// let replacement = |caps: &Captures| -> Result<Vec<u8>, &'static str> { /// if caps[0].len() >= 5 { /// return Err("word too long"); /// } /// Ok(caps[0].len().to_string().into_bytes()) /// }; /// assert_eq!( /// Ok(b"2 3 3 3?".to_vec()), /// replace_all(&re, b"hi how are you?", &replacement), /// ); /// assert!(replace_all(&re, b"hi there", &replacement).is_err()); /// ``` /// /// # Example /// /// This example shows how to flip the order of whitespace (excluding line /// terminators) delimited fields, and normalizes the whitespace that /// delimits the fields: /// /// ``` /// use regex::bytes::Regex; /// /// let re = Regex::new(r"(?m)^(\S+)[\s--\r\n]+(\S+)$").unwrap(); /// let hay = b" /// Greetings 1973 /// Wild\t1973 /// BornToRun\t\t\t\t1975 /// Darkness 1978 /// TheRiver 1980 /// "; /// let new = re.replace_all(hay, b"$2 $1"); /// assert_eq!(new, &b" /// 1973 Greetings /// 1973 Wild /// 1975 BornToRun /// 1978 Darkness /// 1980 TheRiver /// "[..]); /// ``` #[inline] pub fn replace_all<'h, R: Replacer>( &self, haystack: &'h [u8], rep: R, ) -> Cow<'h, [u8]> { self.replacen(haystack, 0, rep) } /// Replaces at most `limit` non-overlapping matches in the haystack with /// the replacement provided. If `limit` is `0`, then all non-overlapping /// matches are replaced. That is, `Regex::replace_all(hay, rep)` is /// equivalent to `Regex::replacen(hay, 0, rep)`. /// /// If no match is found, then the haystack is returned unchanged. In that /// case, this implementation will likely return a `Cow::Borrowed` value /// such that no allocation is performed. /// /// When a `Cow::Borrowed` is returned, the value returned is guaranteed /// to be equivalent to the `haystack` given. /// /// The documentation for [`Regex::replace`] goes into more detail about /// what kinds of replacement strings are supported. /// /// # Time complexity /// /// Since iterators over all matches requires running potentially many /// searches on the haystack, and since each search has worst case /// `O(m * n)` time complexity, the overall worst case time complexity for /// this routine is `O(m * n^2)`. /// /// Although note that the worst case time here has an upper bound given /// by the `limit` parameter. /// /// # Fallibility /// /// See the corresponding section in the docs for [`Regex::replace_all`] /// for tips on how to deal with a replacement routine that can fail. /// /// # Example /// /// This example shows how to flip the order of whitespace (excluding line /// terminators) delimited fields, and normalizes the whitespace that /// delimits the fields. But we only do it for the first two matches. /// /// ``` /// use regex::bytes::Regex; /// /// let re = Regex::new(r"(?m)^(\S+)[\s--\r\n]+(\S+)$").unwrap(); /// let hay = b" /// Greetings 1973 /// Wild\t1973 /// BornToRun\t\t\t\t1975 /// Darkness 1978 /// TheRiver 1980 /// "; /// let new = re.replacen(hay, 2, b"$2 $1"); /// assert_eq!(new, &b" /// 1973 Greetings /// 1973 Wild /// BornToRun\t\t\t\t1975 /// Darkness 1978 /// TheRiver 1980 /// "[..]); /// ``` #[inline] pub fn replacen<'h, R: Replacer>( &self, haystack: &'h [u8], limit: usize, mut rep: R, ) -> Cow<'h, [u8]> { // If we know that the replacement doesn't have any capture expansions, // then we can use the fast path. The fast path can make a tremendous // difference: // // 1) We use `find_iter` instead of `captures_iter`. Not asking for // captures generally makes the regex engines faster. // 2) We don't need to look up all of the capture groups and do // replacements inside the replacement string. We just push it // at each match and be done with it. if let Some(rep) = rep.no_expansion() { let mut it = self.find_iter(haystack).enumerate().peekable(); if it.peek().is_none() { return Cow::Borrowed(haystack); } let mut new = Vec::with_capacity(haystack.len()); let mut last_match = 0; for (i, m) in it { new.extend_from_slice(&haystack[last_match..m.start()]); new.extend_from_slice(&rep); last_match = m.end(); if limit > 0 && i >= limit - 1 { break; } } new.extend_from_slice(&haystack[last_match..]); return Cow::Owned(new); } // The slower path, which we use if the replacement needs access to // capture groups. let mut it = self.captures_iter(haystack).enumerate().peekable(); if it.peek().is_none() { return Cow::Borrowed(haystack); } let mut new = Vec::with_capacity(haystack.len()); let mut last_match = 0; for (i, cap) in it { // unwrap on 0 is OK because captures only reports matches let m = cap.get(0).unwrap(); new.extend_from_slice(&haystack[last_match..m.start()]); rep.replace_append(&cap, &mut new); last_match = m.end(); if limit > 0 && i >= limit - 1 { break; } } new.extend_from_slice(&haystack[last_match..]); Cow::Owned(new) } } /// A group of advanced or "lower level" search methods. Some methods permit /// starting the search at a position greater than `0` in the haystack. Other /// methods permit reusing allocations, for example, when extracting the /// matches for capture groups. impl Regex { /// Returns the end byte offset of the first match in the haystack given. /// /// This method may have the same performance characteristics as /// `is_match`. Behaviorally, it doesn't just report whether it match /// occurs, but also the end offset for a match. In particular, the offset /// returned *may be shorter* than the proper end of the leftmost-first /// match that you would find via [`Regex::find`]. /// /// Note that it is not guaranteed that this routine finds the shortest or /// "earliest" possible match. Instead, the main idea of this API is that /// it returns the offset at the point at which the internal regex engine /// has determined that a match has occurred. This may vary depending on /// which internal regex engine is used, and thus, the offset itself may /// change based on internal heuristics. /// /// # Example /// /// Typically, `a+` would match the entire first sequence of `a` in some /// haystack, but `shortest_match` *may* give up as soon as it sees the /// first `a`. /// /// ``` /// use regex::bytes::Regex; /// /// let re = Regex::new(r"a+").unwrap(); /// let offset = re.shortest_match(b"aaaaa").unwrap(); /// assert_eq!(offset, 1); /// ``` #[inline] pub fn shortest_match(&self, haystack: &[u8]) -> Option<usize> { self.shortest_match_at(haystack, 0) } /// Returns the same as `shortest_match`, but starts the search at the /// given offset. /// /// The significance of the starting point is that it takes the surrounding /// context into consideration. For example, the `\A` anchor can only match /// when `start == 0`. /// /// If a match is found, the offset returned is relative to the beginning /// of the haystack, not the beginning of the search. /// /// # Panics /// /// This panics when `start >= haystack.len() + 1`. /// /// # Example /// /// This example shows the significance of `start` by demonstrating how it /// can be used to permit look-around assertions in a regex to take the /// surrounding context into account. /// /// ``` /// use regex::bytes::Regex; /// /// let re = Regex::new(r"\bchew\b").unwrap(); /// let hay = b"eschew"; /// // We get a match here, but it's probably not intended. /// assert_eq!(re.shortest_match(&hay[2..]), Some(4)); /// // No match because the assertions take the context into account. /// assert_eq!(re.shortest_match_at(hay, 2), None); /// ``` #[inline] pub fn shortest_match_at( &self, haystack: &[u8], start: usize, ) -> Option<usize> { let input = Input::new(haystack).earliest(true).span(start..haystack.len()); self.meta.search_half(&input).map(|hm| hm.offset()) } /// Returns the same as [`Regex::is_match`], but starts the search at the /// given offset. /// /// The significance of the starting point is that it takes the surrounding /// context into consideration. For example, the `\A` anchor can only /// match when `start == 0`. /// /// # Panics /// /// This panics when `start >= haystack.len() + 1`. /// /// # Example /// /// This example shows the significance of `start` by demonstrating how it /// can be used to permit look-around assertions in a regex to take the /// surrounding context into account. /// /// ``` /// use regex::bytes::Regex; /// /// let re = Regex::new(r"\bchew\b").unwrap(); /// let hay = b"eschew"; /// // We get a match here, but it's probably not intended. /// assert!(re.is_match(&hay[2..])); /// // No match because the assertions take the context into account. /// assert!(!re.is_match_at(hay, 2)); /// ``` #[inline] pub fn is_match_at(&self, haystack: &[u8], start: usize) -> bool { self.meta.is_match(Input::new(haystack).span(start..haystack.len())) } /// Returns the same as [`Regex::find`], but starts the search at the given /// offset. /// /// The significance of the starting point is that it takes the surrounding /// context into consideration. For example, the `\A` anchor can only /// match when `start == 0`. /// /// # Panics /// /// This panics when `start >= haystack.len() + 1`. /// /// # Example /// /// This example shows the significance of `start` by demonstrating how it /// can be used to permit look-around assertions in a regex to take the /// surrounding context into account. /// /// ``` /// use regex::bytes::Regex; /// /// let re = Regex::new(r"\bchew\b").unwrap(); /// let hay = b"eschew"; /// // We get a match here, but it's probably not intended. /// assert_eq!(re.find(&hay[2..]).map(|m| m.range()), Some(0..4)); /// // No match because the assertions take the context into account. /// assert_eq!(re.find_at(hay, 2), None); /// ``` #[inline] pub fn find_at<'h>( &self, haystack: &'h [u8], start: usize, ) -> Option<Match<'h>> { let input = Input::new(haystack).span(start..haystack.len()); self.meta.find(input).map(|m| Match::new(haystack, m.start(), m.end())) } /// Returns the same as [`Regex::captures`], but starts the search at the /// given offset. /// /// The significance of the starting point is that it takes the surrounding /// context into consideration. For example, the `\A` anchor can only /// match when `start == 0`. /// /// # Panics /// /// This panics when `start >= haystack.len() + 1`. /// /// # Example /// /// This example shows the significance of `start` by demonstrating how it /// can be used to permit look-around assertions in a regex to take the /// surrounding context into account. /// /// ``` /// use regex::bytes::Regex; /// /// let re = Regex::new(r"\bchew\b").unwrap(); /// let hay = b"eschew"; /// // We get a match here, but it's probably not intended. /// assert_eq!(&re.captures(&hay[2..]).unwrap()[0], b"chew"); /// // No match because the assertions take the context into account. /// assert!(re.captures_at(hay, 2).is_none()); /// ``` #[inline] pub fn captures_at<'h>( &self, haystack: &'h [u8], start: usize, ) -> Option<Captures<'h>> { let input = Input::new(haystack).span(start..haystack.len()); let mut caps = self.meta.create_captures(); self.meta.captures(input, &mut caps); if caps.is_match() { let static_captures_len = self.static_captures_len(); Some(Captures { haystack, caps, static_captures_len }) } else { None } } /// This is like [`Regex::captures`], but writes the byte offsets of each /// capture group match into the locations given. /// /// A [`CaptureLocations`] stores the same byte offsets as a [`Captures`], /// but does *not* store a reference to the haystack. This makes its API /// a bit lower level and less convenient. But in exchange, callers /// may allocate their own `CaptureLocations` and reuse it for multiple /// searches. This may be helpful if allocating a `Captures` shows up in a /// profile as too costly. /// /// To create a `CaptureLocations` value, use the /// [`Regex::capture_locations`] method. /// /// This also returns the overall match if one was found. When a match is /// found, its offsets are also always stored in `locs` at index `0`. /// /// # Example /// /// ``` /// use regex::bytes::Regex; /// /// let re = Regex::new(r"^([a-z]+)=(\S*)$").unwrap(); /// let mut locs = re.capture_locations(); /// assert!(re.captures_read(&mut locs, b"id=foo123").is_some()); /// assert_eq!(Some((0, 9)), locs.get(0)); /// assert_eq!(Some((0, 2)), locs.get(1)); /// assert_eq!(Some((3, 9)), locs.get(2)); /// ``` #[inline] pub fn captures_read<'h>( &self, locs: &mut CaptureLocations, haystack: &'h [u8], ) -> Option<Match<'h>> { self.captures_read_at(locs, haystack, 0) } /// Returns the same as [`Regex::captures_read`], but starts the search at /// the given offset. /// /// The significance of the starting point is that it takes the surrounding /// context into consideration. For example, the `\A` anchor can only /// match when `start == 0`. /// /// # Panics /// /// This panics when `start >= haystack.len() + 1`. /// /// # Example /// /// This example shows the significance of `start` by demonstrating how it /// can be used to permit look-around assertions in a regex to take the /// surrounding context into account. /// /// ``` /// use regex::bytes::Regex; /// /// let re = Regex::new(r"\bchew\b").unwrap(); /// let hay = b"eschew"; /// let mut locs = re.capture_locations(); /// // We get a match here, but it's probably not intended. /// assert!(re.captures_read(&mut locs, &hay[2..]).is_some()); /// // No match because the assertions take the context into account. /// assert!(re.captures_read_at(&mut locs, hay, 2).is_none()); /// ``` #[inline] pub fn captures_read_at<'h>( &self, locs: &mut CaptureLocations, haystack: &'h [u8], start: usize, ) -> Option<Match<'h>> { let input = Input::new(haystack).span(start..haystack.len()); self.meta.search_captures(&input, &mut locs.0); locs.0.get_match().map(|m| Match::new(haystack, m.start(), m.end())) } /// An undocumented alias for `captures_read_at`. /// /// The `regex-capi` crate previously used this routine, so to avoid /// breaking that crate, we continue to provide the name as an undocumented /// alias. #[doc(hidden)] #[inline] pub fn read_captures_at<'h>( &self, locs: &mut CaptureLocations, haystack: &'h [u8], start: usize, ) -> Option<Match<'h>> { self.captures_read_at(locs, haystack, start) } } /// Auxiliary methods. impl Regex { /// Returns the original string of this regex. /// /// # Example /// /// ``` /// use regex::bytes::Regex; /// /// let re = Regex::new(r"foo\w+bar").unwrap(); /// assert_eq!(re.as_str(), r"foo\w+bar"); /// ``` #[inline] pub fn as_str(&self) -> &str { &self.pattern } /// Returns an iterator over the capture names in this regex. /// /// The iterator returned yields elements of type `Option<&str>`. That is, /// the iterator yields values for all capture groups, even ones that are /// unnamed. The order of the groups corresponds to the order of the group's /// corresponding opening parenthesis. /// /// The first element of the iterator always yields the group corresponding /// to the overall match, and this group is always unnamed. Therefore, the /// iterator always yields at least one group. /// /// # Example /// /// This shows basic usage with a mix of named and unnamed capture groups: /// /// ``` /// use regex::bytes::Regex; /// /// let re = Regex::new(r"(?<a>.(?<b>.))(.)(?:.)(?<c>.)").unwrap(); /// let mut names = re.capture_names(); /// assert_eq!(names.next(), Some(None)); /// assert_eq!(names.next(), Some(Some("a"))); /// assert_eq!(names.next(), Some(Some("b"))); /// assert_eq!(names.next(), Some(None)); /// // the '(?:.)' group is non-capturing and so doesn't appear here! /// assert_eq!(names.next(), Some(Some("c"))); /// assert_eq!(names.next(), None); /// ``` /// /// The iterator always yields at least one element, even for regexes with /// no capture groups and even for regexes that can never match: /// /// ``` /// use regex::bytes::Regex; /// /// let re = Regex::new(r"").unwrap(); /// let mut names = re.capture_names(); /// assert_eq!(names.next(), Some(None)); /// assert_eq!(names.next(), None); /// /// let re = Regex::new(r"[a&&b]").unwrap(); /// let mut names = re.capture_names(); /// assert_eq!(names.next(), Some(None)); /// assert_eq!(names.next(), None); /// ``` #[inline] pub fn capture_names(&self) -> CaptureNames<'_> { CaptureNames(self.meta.group_info().pattern_names(PatternID::ZERO)) } /// Returns the number of captures groups in this regex. /// /// This includes all named and unnamed groups, including the implicit /// unnamed group that is always present and corresponds to the entire /// match. /// /// Since the implicit unnamed group is always included in this length, the /// length returned is guaranteed to be greater than zero. /// /// # Example /// /// ``` /// use regex::bytes::Regex; /// /// let re = Regex::new(r"foo").unwrap(); /// assert_eq!(1, re.captures_len()); /// /// let re = Regex::new(r"(foo)").unwrap(); /// assert_eq!(2, re.captures_len()); /// /// let re = Regex::new(r"(?<a>.(?<b>.))(.)(?:.)(?<c>.)").unwrap(); /// assert_eq!(5, re.captures_len()); /// /// let re = Regex::new(r"[a&&b]").unwrap(); /// assert_eq!(1, re.captures_len()); /// ``` #[inline] pub fn captures_len(&self) -> usize { self.meta.group_info().group_len(PatternID::ZERO) } /// Returns the total number of capturing groups that appear in every /// possible match. /// /// If the number of capture groups can vary depending on the match, then /// this returns `None`. That is, a value is only returned when the number /// of matching groups is invariant or "static." /// /// Note that like [`Regex::captures_len`], this **does** include the /// implicit capturing group corresponding to the entire match. Therefore, /// when a non-None value is returned, it is guaranteed to be at least `1`. /// Stated differently, a return value of `Some(0)` is impossible. /// /// # Example /// /// This shows a few cases where a static number of capture groups is /// available and a few cases where it is not. /// /// ``` /// use regex::bytes::Regex; /// /// let len = |pattern| { /// Regex::new(pattern).map(|re| re.static_captures_len()) /// }; /// /// assert_eq!(Some(1), len("a")?); /// assert_eq!(Some(2), len("(a)")?); /// assert_eq!(Some(2), len("(a)|(b)")?); /// assert_eq!(Some(3), len("(a)(b)|(c)(d)")?); /// assert_eq!(None, len("(a)|b")?); /// assert_eq!(None, len("a|(b)")?); /// assert_eq!(None, len("(b)*")?); /// assert_eq!(Some(2), len("(b)+")?); /// /// # Ok::<(), Box<dyn std::error::Error>>(()) /// ``` #[inline] pub fn static_captures_len(&self) -> Option<usize> { self.meta.static_captures_len() } /// Returns a fresh allocated set of capture locations that can /// be reused in multiple calls to [`Regex::captures_read`] or /// [`Regex::captures_read_at`]. /// /// # Example /// /// ``` /// use regex::bytes::Regex; /// /// let re = Regex::new(r"(.)(.)(\w+)").unwrap(); /// let mut locs = re.capture_locations(); /// assert!(re.captures_read(&mut locs, b"Padron").is_some()); /// assert_eq!(locs.get(0), Some((0, 6))); /// assert_eq!(locs.get(1), Some((0, 1))); /// assert_eq!(locs.get(2), Some((1, 2))); /// assert_eq!(locs.get(3), Some((2, 6))); /// ``` #[inline] pub fn capture_locations(&self) -> CaptureLocations { CaptureLocations(self.meta.create_captures()) } /// An alias for `capture_locations` to preserve backward compatibility. /// /// The `regex-capi` crate uses this method, so to avoid breaking that /// crate, we continue to export it as an undocumented API. #[doc(hidden)] #[inline] pub fn locations(&self) -> CaptureLocations { self.capture_locations() } } /// Represents a single match of a regex in a haystack. /// /// A `Match` contains both the start and end byte offsets of the match and the /// actual substring corresponding to the range of those byte offsets. It is /// guaranteed that `start <= end`. When `start == end`, the match is empty. /// /// Unlike the top-level `Match` type, this `Match` type is produced by APIs /// that search `&[u8]` haystacks. This means that the offsets in a `Match` can /// point to anywhere in the haystack, including in a place that splits the /// UTF-8 encoding of a Unicode scalar value. /// /// The lifetime parameter `'h` refers to the lifetime of the matched of the /// haystack that this match was produced from. /// /// # Numbering /// /// The byte offsets in a `Match` form a half-open interval. That is, the /// start of the range is inclusive and the end of the range is exclusive. /// For example, given a haystack `abcFOOxyz` and a match of `FOO`, its byte /// offset range starts at `3` and ends at `6`. `3` corresponds to `F` and /// `6` corresponds to `x`, which is one past the end of the match. This /// corresponds to the same kind of slicing that Rust uses. /// /// For more on why this was chosen over other schemes (aside from being /// consistent with how Rust the language works), see [this discussion] and /// [Dijkstra's note on a related topic][note]. /// /// [this discussion]: https://github.com/rust-lang/regex/discussions/866 /// [note]: https://www.cs.utexas.edu/users/EWD/transcriptions/EWD08xx/EWD831.html /// /// # Example /// /// This example shows the value of each of the methods on `Match` for a /// particular search. /// /// ``` /// use regex::bytes::Regex; /// /// let re = Regex::new(r"\p{Greek}+").unwrap(); /// let hay = "Greek: ฮฑฮฒฮณฮด".as_bytes(); /// let m = re.find(hay).unwrap(); /// assert_eq!(7, m.start()); /// assert_eq!(15, m.end()); /// assert!(!m.is_empty()); /// assert_eq!(8, m.len()); /// assert_eq!(7..15, m.range()); /// assert_eq!("ฮฑฮฒฮณฮด".as_bytes(), m.as_bytes()); /// ``` #[derive(Copy, Clone, Eq, PartialEq)] pub struct Match<'h> { haystack: &'h [u8], start: usize, end: usize, } impl<'h> Match<'h> { /// Returns the byte offset of the start of the match in the haystack. The /// start of the match corresponds to the position where the match begins /// and includes the first byte in the match. /// /// It is guaranteed that `Match::start() <= Match::end()`. /// /// Unlike the top-level `Match` type, the start offset may appear anywhere /// in the haystack. This includes between the code units of a UTF-8 /// encoded Unicode scalar value. #[inline] pub fn start(&self) -> usize { self.start } /// Returns the byte offset of the end of the match in the haystack. The /// end of the match corresponds to the byte immediately following the last /// byte in the match. This means that `&slice[start..end]` works as one /// would expect. /// /// It is guaranteed that `Match::start() <= Match::end()`. /// /// Unlike the top-level `Match` type, the start offset may appear anywhere /// in the haystack. This includes between the code units of a UTF-8 /// encoded Unicode scalar value. #[inline] pub fn end(&self) -> usize { self.end } /// Returns true if and only if this match has a length of zero. /// /// Note that an empty match can only occur when the regex itself can /// match the empty string. Here are some examples of regexes that can /// all match the empty string: `^`, `^$`, `\b`, `a?`, `a*`, `a{0}`, /// `(foo|\d+|quux)?`. #[inline] pub fn is_empty(&self) -> bool { self.start == self.end } /// Returns the length, in bytes, of this match. #[inline] pub fn len(&self) -> usize { self.end - self.start } /// Returns the range over the starting and ending byte offsets of the /// match in the haystack. #[inline] pub fn range(&self) -> core::ops::Range<usize> { self.start..self.end } /// Returns the substring of the haystack that matched. #[inline] pub fn as_bytes(&self) -> &'h [u8] { &self.haystack[self.range()] } /// Creates a new match from the given haystack and byte offsets. #[inline] fn new(haystack: &'h [u8], start: usize, end: usize) -> Match<'h> { Match { haystack, start, end } } } impl<'h> core::fmt::Debug for Match<'h> { fn fmt(&self, f: &mut core::fmt::Formatter) -> core::fmt::Result { use regex_automata::util::escape::DebugHaystack; let mut fmt = f.debug_struct("Match"); fmt.field("start", &self.start) .field("end", &self.end) .field("bytes", &DebugHaystack(&self.as_bytes())); fmt.finish() } } impl<'h> From<Match<'h>> for &'h [u8] { fn from(m: Match<'h>) -> &'h [u8] { m.as_bytes() } } impl<'h> From<Match<'h>> for core::ops::Range<usize> { fn from(m: Match<'h>) -> core::ops::Range<usize> { m.range() } } /// Represents the capture groups for a single match. /// /// Capture groups refer to parts of a regex enclosed in parentheses. They /// can be optionally named. The purpose of capture groups is to be able to /// reference different parts of a match based on the original pattern. In /// essence, a `Captures` is a container of [`Match`] values for each group /// that participated in a regex match. Each `Match` can be looked up by either /// its capture group index or name (if it has one). /// /// For example, say you want to match the individual letters in a 5-letter /// word: /// /// ```text /// (?<first>\w)(\w)(?:\w)\w(?<last>\w) /// ``` /// /// This regex has 4 capture groups: /// /// * The group at index `0` corresponds to the overall match. It is always /// present in every match and never has a name. /// * The group at index `1` with name `first` corresponding to the first /// letter. /// * The group at index `2` with no name corresponding to the second letter. /// * The group at index `3` with name `last` corresponding to the fifth and /// last letter. /// /// Notice that `(?:\w)` was not listed above as a capture group despite it /// being enclosed in parentheses. That's because `(?:pattern)` is a special /// syntax that permits grouping but *without* capturing. The reason for not /// treating it as a capture is that tracking and reporting capture groups /// requires additional state that may lead to slower searches. So using as few /// capture groups as possible can help performance. (Although the difference /// in performance of a couple of capture groups is likely immaterial.) /// /// Values with this type are created by [`Regex::captures`] or /// [`Regex::captures_iter`]. /// /// `'h` is the lifetime of the haystack that these captures were matched from. /// /// # Example /// /// ``` /// use regex::bytes::Regex; /// /// let re = Regex::new(r"(?<first>\w)(\w)(?:\w)\w(?<last>\w)").unwrap(); /// let caps = re.captures(b"toady").unwrap(); /// assert_eq!(b"toady", &caps[0]); /// assert_eq!(b"t", &caps["first"]); /// assert_eq!(b"o", &caps[2]); /// assert_eq!(b"y", &caps["last"]); /// ``` pub struct Captures<'h> { haystack: &'h [u8], caps: captures::Captures, static_captures_len: Option<usize>, } impl<'h> Captures<'h> { /// Returns the `Match` associated with the capture group at index `i`. If /// `i` does not correspond to a capture group, or if the capture group did /// not participate in the match, then `None` is returned. /// /// When `i == 0`, this is guaranteed to return a non-`None` value. /// /// # Examples /// /// Get the substring that matched with a default of an empty string if the /// group didn't participate in the match: /// /// ``` /// use regex::bytes::Regex; /// /// let re = Regex::new(r"[a-z]+(?:([0-9]+)|([A-Z]+))").unwrap(); /// let caps = re.captures(b"abc123").unwrap(); /// /// let substr1 = caps.get(1).map_or(&b""[..], |m| m.as_bytes()); /// let substr2 = caps.get(2).map_or(&b""[..], |m| m.as_bytes()); /// assert_eq!(substr1, b"123"); /// assert_eq!(substr2, b""); /// ``` #[inline] pub fn get(&self, i: usize) -> Option<Match<'h>> { self.caps .get_group(i) .map(|sp| Match::new(self.haystack, sp.start, sp.end)) } /// Return the overall match for the capture. /// /// This returns the match for index `0`. That is it is equivalent to /// `m.get(0).unwrap()` /// /// # Example /// /// ``` /// use regex::bytes::Regex; /// /// let re = Regex::new(r"[a-z]+([0-9]+)").unwrap(); /// let caps = re.captures(b" abc123-def").unwrap(); /// /// assert_eq!(caps.get_match().as_bytes(), b"abc123"); /// ``` #[inline] pub fn get_match(&self) -> Match<'h> { self.get(0).unwrap() } /// Returns the `Match` associated with the capture group named `name`. If /// `name` isn't a valid capture group or it refers to a group that didn't /// match, then `None` is returned. /// /// Note that unlike `caps["name"]`, this returns a `Match` whose lifetime /// matches the lifetime of the haystack in this `Captures` value. /// Conversely, the substring returned by `caps["name"]` has a lifetime /// of the `Captures` value, which is likely shorter than the lifetime of /// the haystack. In some cases, it may be necessary to use this method to /// access the matching substring instead of the `caps["name"]` notation. /// /// # Examples /// /// Get the substring that matched with a default of an empty string if the /// group didn't participate in the match: /// /// ``` /// use regex::bytes::Regex; /// /// let re = Regex::new( /// r"[a-z]+(?:(?<numbers>[0-9]+)|(?<letters>[A-Z]+))", /// ).unwrap(); /// let caps = re.captures(b"abc123").unwrap(); /// /// let numbers = caps.name("numbers").map_or(&b""[..], |m| m.as_bytes()); /// let letters = caps.name("letters").map_or(&b""[..], |m| m.as_bytes()); /// assert_eq!(numbers, b"123"); /// assert_eq!(letters, b""); /// ``` #[inline] pub fn name(&self, name: &str) -> Option<Match<'h>> { self.caps .get_group_by_name(name) .map(|sp| Match::new(self.haystack, sp.start, sp.end)) } /// This is a convenience routine for extracting the substrings /// corresponding to matching capture groups. /// /// This returns a tuple where the first element corresponds to the full /// substring of the haystack that matched the regex. The second element is /// an array of substrings, with each corresponding to the substring that /// matched for a particular capture group. /// /// # Panics /// /// This panics if the number of possible matching groups in this /// `Captures` value is not fixed to `N` in all circumstances. /// More precisely, this routine only works when `N` is equivalent to /// [`Regex::static_captures_len`]. /// /// Stated more plainly, if the number of matching capture groups in a /// regex can vary from match to match, then this function always panics. /// /// For example, `(a)(b)|(c)` could produce two matching capture groups /// or one matching capture group for any given match. Therefore, one /// cannot use `extract` with such a pattern. /// /// But a pattern like `(a)(b)|(c)(d)` can be used with `extract` because /// the number of capture groups in every match is always equivalent, /// even if the capture _indices_ in each match are not. /// /// # Example /// /// ``` /// use regex::bytes::Regex; /// /// let re = Regex::new(r"([0-9]{4})-([0-9]{2})-([0-9]{2})").unwrap(); /// let hay = b"On 2010-03-14, I became a Tennessee lamb."; /// let Some((full, [year, month, day])) = /// re.captures(hay).map(|caps| caps.extract()) else { return }; /// assert_eq!(b"2010-03-14", full); /// assert_eq!(b"2010", year); /// assert_eq!(b"03", month); /// assert_eq!(b"14", day); /// ``` /// /// # Example: iteration /// /// This example shows how to use this method when iterating over all /// `Captures` matches in a haystack. /// /// ``` /// use regex::bytes::Regex; /// /// let re = Regex::new(r"([0-9]{4})-([0-9]{2})-([0-9]{2})").unwrap(); /// let hay = b"1973-01-05, 1975-08-25 and 1980-10-18"; /// /// let mut dates: Vec<(&[u8], &[u8], &[u8])> = vec![]; /// for (_, [y, m, d]) in re.captures_iter(hay).map(|c| c.extract()) { /// dates.push((y, m, d)); /// } /// assert_eq!(dates, vec![ /// (&b"1973"[..], &b"01"[..], &b"05"[..]), /// (&b"1975"[..], &b"08"[..], &b"25"[..]), /// (&b"1980"[..], &b"10"[..], &b"18"[..]), /// ]); /// ``` /// /// # Example: parsing different formats /// /// This API is particularly useful when you need to extract a particular /// value that might occur in a different format. Consider, for example, /// an identifier that might be in double quotes or single quotes: /// /// ``` /// use regex::bytes::Regex; /// /// let re = Regex::new(r#"id:(?:"([^"]+)"|'([^']+)')"#).unwrap(); /// let hay = br#"The first is id:"foo" and the second is id:'bar'."#; /// let mut ids = vec![]; /// for (_, [id]) in re.captures_iter(hay).map(|c| c.extract()) { /// ids.push(id); /// } /// assert_eq!(ids, vec![b"foo", b"bar"]); /// ``` pub fn extract<const N: usize>(&self) -> (&'h [u8], [&'h [u8]; N]) { let len = self .static_captures_len .expect("number of capture groups can vary in a match") .checked_sub(1) .expect("number of groups is always greater than zero"); assert_eq!(N, len, "asked for {N} groups, but must ask for {len}"); // The regex-automata variant of extract is a bit more permissive. // It doesn't require the number of matching capturing groups to be // static, and you can even request fewer groups than what's there. So // this is guaranteed to never panic because we've asserted above that // the user has requested precisely the number of groups that must be // present in any match for this regex. self.caps.extract_bytes(self.haystack) } /// Expands all instances of `$ref` in `replacement` to the corresponding /// capture group, and writes them to the `dst` buffer given. A `ref` can /// be a capture group index or a name. If `ref` doesn't refer to a capture /// group that participated in the match, then it is replaced with the /// empty string. /// /// # Format /// /// The format of the replacement string supports two different kinds of /// capture references: unbraced and braced. /// /// For the unbraced format, the format supported is `$ref` where `name` /// can be any character in the class `[0-9A-Za-z_]`. `ref` is always /// the longest possible parse. So for example, `$1a` corresponds to the /// capture group named `1a` and not the capture group at index `1`. If /// `ref` matches `^[0-9]+$`, then it is treated as a capture group index /// itself and not a name. /// /// For the braced format, the format supported is `${ref}` where `ref` can /// be any sequence of bytes except for `}`. If no closing brace occurs, /// then it is not considered a capture reference. As with the unbraced /// format, if `ref` matches `^[0-9]+$`, then it is treated as a capture /// group index and not a name. /// /// The braced format is useful for exerting precise control over the name /// of the capture reference. For example, `${1}a` corresponds to the /// capture group reference `1` followed by the letter `a`, where as `$1a` /// (as mentioned above) corresponds to the capture group reference `1a`. /// The braced format is also useful for expressing capture group names /// that use characters not supported by the unbraced format. For example, /// `${foo[bar].baz}` refers to the capture group named `foo[bar].baz`. /// /// If a capture group reference is found and it does not refer to a valid /// capture group, then it will be replaced with the empty string. /// /// To write a literal `$`, use `$$`. /// /// # Example /// /// ``` /// use regex::bytes::Regex; /// /// let re = Regex::new( /// r"(?<day>[0-9]{2})-(?<month>[0-9]{2})-(?<year>[0-9]{4})", /// ).unwrap(); /// let hay = b"On 14-03-2010, I became a Tennessee lamb."; /// let caps = re.captures(hay).unwrap(); /// /// let mut dst = vec![]; /// caps.expand(b"year=$year, month=$month, day=$day", &mut dst); /// assert_eq!(dst, b"year=2010, month=03, day=14"); /// ``` #[inline] pub fn expand(&self, replacement: &[u8], dst: &mut Vec<u8>) { self.caps.interpolate_bytes_into(self.haystack, replacement, dst); } /// Returns an iterator over all capture groups. This includes both /// matching and non-matching groups. /// /// The iterator always yields at least one matching group: the first group /// (at index `0`) with no name. Subsequent groups are returned in the order /// of their opening parenthesis in the regex. /// /// The elements yielded have type `Option<Match<'h>>`, where a non-`None` /// value is present if the capture group matches. /// /// # Example /// /// ``` /// use regex::bytes::Regex; /// /// let re = Regex::new(r"(\w)(\d)?(\w)").unwrap(); /// let caps = re.captures(b"AZ").unwrap(); /// /// let mut it = caps.iter(); /// assert_eq!(it.next().unwrap().map(|m| m.as_bytes()), Some(&b"AZ"[..])); /// assert_eq!(it.next().unwrap().map(|m| m.as_bytes()), Some(&b"A"[..])); /// assert_eq!(it.next().unwrap().map(|m| m.as_bytes()), None); /// assert_eq!(it.next().unwrap().map(|m| m.as_bytes()), Some(&b"Z"[..])); /// assert_eq!(it.next(), None); /// ``` #[inline] pub fn iter<'c>(&'c self) -> SubCaptureMatches<'c, 'h> { SubCaptureMatches { haystack: self.haystack, it: self.caps.iter() } } /// Returns the total number of capture groups. This includes both /// matching and non-matching groups. /// /// The length returned is always equivalent to the number of elements /// yielded by [`Captures::iter`]. Consequently, the length is always /// greater than zero since every `Captures` value always includes the /// match for the entire regex. /// /// # Example /// /// ``` /// use regex::bytes::Regex; /// /// let re = Regex::new(r"(\w)(\d)?(\w)").unwrap(); /// let caps = re.captures(b"AZ").unwrap(); /// assert_eq!(caps.len(), 4); /// ``` #[inline] pub fn len(&self) -> usize { self.caps.group_len() } } impl<'h> core::fmt::Debug for Captures<'h> { fn fmt(&self, f: &mut core::fmt::Formatter<'_>) -> core::fmt::Result { /// A little helper type to provide a nice map-like debug /// representation for our capturing group spans. /// /// regex-automata has something similar, but it includes the pattern /// ID in its debug output, which is confusing. It also doesn't include /// that strings that match because a regex-automata `Captures` doesn't /// borrow the haystack. struct CapturesDebugMap<'a> { caps: &'a Captures<'a>, } impl<'a> core::fmt::Debug for CapturesDebugMap<'a> { fn fmt(&self, f: &mut core::fmt::Formatter) -> core::fmt::Result { let mut map = f.debug_map(); let names = self.caps.caps.group_info().pattern_names(PatternID::ZERO); for (group_index, maybe_name) in names.enumerate() { let key = Key(group_index, maybe_name); match self.caps.get(group_index) { None => map.entry(&key, &None::<()>), Some(mat) => map.entry(&key, &Value(mat)), }; } map.finish() } } struct Key<'a>(usize, Option<&'a str>); impl<'a> core::fmt::Debug for Key<'a> { fn fmt(&self, f: &mut core::fmt::Formatter) -> core::fmt::Result { write!(f, "{}", self.0)?; if let Some(name) = self.1 { write!(f, "/{name:?}")?; } Ok(()) } } struct Value<'a>(Match<'a>); impl<'a> core::fmt::Debug for Value<'a> { fn fmt(&self, f: &mut core::fmt::Formatter) -> core::fmt::Result { use regex_automata::util::escape::DebugHaystack; write!( f, "{}..{}/{:?}", self.0.start(), self.0.end(), DebugHaystack(self.0.as_bytes()) ) } } f.debug_tuple("Captures") .field(&CapturesDebugMap { caps: self }) .finish() } } /// Get a matching capture group's haystack substring by index. /// /// The haystack substring returned can't outlive the `Captures` object if this /// method is used, because of how `Index` is defined (normally `a[i]` is part /// of `a` and can't outlive it). To work around this limitation, do that, use /// [`Captures::get`] instead. /// /// `'h` is the lifetime of the matched haystack, but the lifetime of the /// `&str` returned by this implementation is the lifetime of the `Captures` /// value itself. /// /// # Panics /// /// If there is no matching group at the given index. impl<'h> core::ops::Index<usize> for Captures<'h> { type Output = [u8]; // The lifetime is written out to make it clear that the &str returned // does NOT have a lifetime equivalent to 'h. fn index<'a>(&'a self, i: usize) -> &'a [u8] { self.get(i) .map(|m| m.as_bytes()) .unwrap_or_else(|| panic!("no group at index '{i}'")) } } /// Get a matching capture group's haystack substring by name. /// /// The haystack substring returned can't outlive the `Captures` object if this /// method is used, because of how `Index` is defined (normally `a[i]` is part /// of `a` and can't outlive it). To work around this limitation, do that, use /// [`Captures::name`] instead. /// /// `'h` is the lifetime of the matched haystack, but the lifetime of the /// `&str` returned by this implementation is the lifetime of the `Captures` /// value itself. /// /// `'n` is the lifetime of the group name used to index the `Captures` value. /// /// # Panics /// /// If there is no matching group at the given name. impl<'h, 'n> core::ops::Index<&'n str> for Captures<'h> { type Output = [u8]; fn index<'a>(&'a self, name: &'n str) -> &'a [u8] { self.name(name) .map(|m| m.as_bytes()) .unwrap_or_else(|| panic!("no group named '{name}'")) } } /// A low level representation of the byte offsets of each capture group. /// /// You can think of this as a lower level [`Captures`], where this type does /// not support named capturing groups directly and it does not borrow the /// haystack that these offsets were matched on. /// /// Primarily, this type is useful when using the lower level `Regex` APIs such /// as [`Regex::captures_read`], which permits amortizing the allocation in /// which capture match offsets are stored. /// /// In order to build a value of this type, you'll need to call the /// [`Regex::capture_locations`] method. The value returned can then be reused /// in subsequent searches for that regex. Using it for other regexes may /// result in a panic or otherwise incorrect results. /// /// # Example /// /// This example shows how to create and use `CaptureLocations` in a search. /// /// ``` /// use regex::bytes::Regex; /// /// let re = Regex::new(r"(?<first>\w+)\s+(?<last>\w+)").unwrap(); /// let mut locs = re.capture_locations(); /// let m = re.captures_read(&mut locs, b"Bruce Springsteen").unwrap(); /// assert_eq!(0..17, m.range()); /// assert_eq!(Some((0, 17)), locs.get(0)); /// assert_eq!(Some((0, 5)), locs.get(1)); /// assert_eq!(Some((6, 17)), locs.get(2)); /// /// // Asking for an invalid capture group always returns None. /// assert_eq!(None, locs.get(3)); /// # // literals are too big for 32-bit usize: #1041 /// # #[cfg(target_pointer_width = "64")] /// assert_eq!(None, locs.get(34973498648)); /// # #[cfg(target_pointer_width = "64")] /// assert_eq!(None, locs.get(9944060567225171988)); /// ``` #[derive(Clone, Debug)] pub struct CaptureLocations(captures::Captures); /// A type alias for `CaptureLocations` for backwards compatibility. /// /// Previously, we exported `CaptureLocations` as `Locations` in an /// undocumented API. To prevent breaking that code (e.g., in `regex-capi`), /// we continue re-exporting the same undocumented API. #[doc(hidden)] pub type Locations = CaptureLocations; impl CaptureLocations { /// Returns the start and end byte offsets of the capture group at index /// `i`. This returns `None` if `i` is not a valid capture group or if the /// capture group did not match. /// /// # Example /// /// ``` /// use regex::bytes::Regex; /// /// let re = Regex::new(r"(?<first>\w+)\s+(?<last>\w+)").unwrap(); /// let mut locs = re.capture_locations(); /// re.captures_read(&mut locs, b"Bruce Springsteen").unwrap(); /// assert_eq!(Some((0, 17)), locs.get(0)); /// assert_eq!(Some((0, 5)), locs.get(1)); /// assert_eq!(Some((6, 17)), locs.get(2)); /// ``` #[inline] pub fn get(&self, i: usize) -> Option<(usize, usize)> { self.0.get_group(i).map(|sp| (sp.start, sp.end)) } /// Returns the total number of capture groups (even if they didn't match). /// That is, the length returned is unaffected by the result of a search. /// /// This is always at least `1` since every regex has at least `1` /// capturing group that corresponds to the entire match. /// /// # Example /// /// ``` /// use regex::bytes::Regex; /// /// let re = Regex::new(r"(?<first>\w+)\s+(?<last>\w+)").unwrap(); /// let mut locs = re.capture_locations(); /// assert_eq!(3, locs.len()); /// re.captures_read(&mut locs, b"Bruce Springsteen").unwrap(); /// assert_eq!(3, locs.len()); /// ``` /// /// Notice that the length is always at least `1`, regardless of the regex: /// /// ``` /// use regex::bytes::Regex; /// /// let re = Regex::new(r"").unwrap(); /// let locs = re.capture_locations(); /// assert_eq!(1, locs.len()); /// /// // [a&&b] is a regex that never matches anything. /// let re = Regex::new(r"[a&&b]").unwrap(); /// let locs = re.capture_locations(); /// assert_eq!(1, locs.len()); /// ``` #[inline] pub fn len(&self) -> usize { // self.0.group_len() returns 0 if the underlying captures doesn't // represent a match, but the behavior guaranteed for this method is // that the length doesn't change based on a match or not. self.0.group_info().group_len(PatternID::ZERO) } /// An alias for the `get` method for backwards compatibility. /// /// Previously, we exported `get` as `pos` in an undocumented API. To /// prevent breaking that code (e.g., in `regex-capi`), we continue /// re-exporting the same undocumented API. #[doc(hidden)] #[inline] pub fn pos(&self, i: usize) -> Option<(usize, usize)> { self.get(i) } } /// An iterator over all non-overlapping matches in a haystack. /// /// This iterator yields [`Match`] values. The iterator stops when no more /// matches can be found. /// /// `'r` is the lifetime of the compiled regular expression and `'h` is the /// lifetime of the haystack. /// /// This iterator is created by [`Regex::find_iter`]. /// /// # Time complexity /// /// Note that since an iterator runs potentially many searches on the haystack /// and since each search has worst case `O(m * n)` time complexity, the /// overall worst case time complexity for iteration is `O(m * n^2)`. #[derive(Debug)] pub struct Matches<'r, 'h> { haystack: &'h [u8], it: meta::FindMatches<'r, 'h>, } impl<'r, 'h> Iterator for Matches<'r, 'h> { type Item = Match<'h>; #[inline] fn next(&mut self) -> Option<Match<'h>> { self.it .next() .map(|sp| Match::new(self.haystack, sp.start(), sp.end())) } #[inline] fn count(self) -> usize { // This can actually be up to 2x faster than calling `next()` until // completion, because counting matches when using a DFA only requires // finding the end of each match. But returning a `Match` via `next()` // requires the start of each match which, with a DFA, requires a // reverse forward scan to find it. self.it.count() } } impl<'r, 'h> core::iter::FusedIterator for Matches<'r, 'h> {} /// An iterator over all non-overlapping capture matches in a haystack. /// /// This iterator yields [`Captures`] values. The iterator stops when no more /// matches can be found. /// /// `'r` is the lifetime of the compiled regular expression and `'h` is the /// lifetime of the matched string. /// /// This iterator is created by [`Regex::captures_iter`]. /// /// # Time complexity /// /// Note that since an iterator runs potentially many searches on the haystack /// and since each search has worst case `O(m * n)` time complexity, the /// overall worst case time complexity for iteration is `O(m * n^2)`. #[derive(Debug)] pub struct CaptureMatches<'r, 'h> { haystack: &'h [u8], it: meta::CapturesMatches<'r, 'h>, } impl<'r, 'h> Iterator for CaptureMatches<'r, 'h> { type Item = Captures<'h>; #[inline] fn next(&mut self) -> Option<Captures<'h>> { let static_captures_len = self.it.regex().static_captures_len(); self.it.next().map(|caps| Captures { haystack: self.haystack, caps, static_captures_len, }) } #[inline] fn count(self) -> usize { // This can actually be up to 2x faster than calling `next()` until // completion, because counting matches when using a DFA only requires // finding the end of each match. But returning a `Match` via `next()` // requires the start of each match which, with a DFA, requires a // reverse forward scan to find it. self.it.count() } } impl<'r, 'h> core::iter::FusedIterator for CaptureMatches<'r, 'h> {} /// An iterator over all substrings delimited by a regex match. /// /// `'r` is the lifetime of the compiled regular expression and `'h` is the /// lifetime of the byte string being split. /// /// This iterator is created by [`Regex::split`]. /// /// # Time complexity /// /// Note that since an iterator runs potentially many searches on the haystack /// and since each search has worst case `O(m * n)` time complexity, the /// overall worst case time complexity for iteration is `O(m * n^2)`. #[derive(Debug)] pub struct Split<'r, 'h> { haystack: &'h [u8], it: meta::Split<'r, 'h>, } impl<'r, 'h> Iterator for Split<'r, 'h> { type Item = &'h [u8]; #[inline] fn next(&mut self) -> Option<&'h [u8]> { self.it.next().map(|span| &self.haystack[span]) } } impl<'r, 'h> core::iter::FusedIterator for Split<'r, 'h> {} /// An iterator over at most `N` substrings delimited by a regex match. /// /// The last substring yielded by this iterator will be whatever remains after /// `N-1` splits. /// /// `'r` is the lifetime of the compiled regular expression and `'h` is the /// lifetime of the byte string being split. /// /// This iterator is created by [`Regex::splitn`]. /// /// # Time complexity /// /// Note that since an iterator runs potentially many searches on the haystack /// and since each search has worst case `O(m * n)` time complexity, the /// overall worst case time complexity for iteration is `O(m * n^2)`. /// /// Although note that the worst case time here has an upper bound given /// by the `limit` parameter to [`Regex::splitn`]. #[derive(Debug)] pub struct SplitN<'r, 'h> { haystack: &'h [u8], it: meta::SplitN<'r, 'h>, } impl<'r, 'h> Iterator for SplitN<'r, 'h> { type Item = &'h [u8]; #[inline] fn next(&mut self) -> Option<&'h [u8]> { self.it.next().map(|span| &self.haystack[span]) } #[inline] fn size_hint(&self) -> (usize, Option<usize>) { self.it.size_hint() } } impl<'r, 'h> core::iter::FusedIterator for SplitN<'r, 'h> {} /// An iterator over the names of all capture groups in a regex. /// /// This iterator yields values of type `Option<&str>` in order of the opening /// capture group parenthesis in the regex pattern. `None` is yielded for /// groups with no name. The first element always corresponds to the implicit /// and unnamed group for the overall match. /// /// `'r` is the lifetime of the compiled regular expression. /// /// This iterator is created by [`Regex::capture_names`]. #[derive(Clone, Debug)] pub struct CaptureNames<'r>(captures::GroupInfoPatternNames<'r>); impl<'r> Iterator for CaptureNames<'r> { type Item = Option<&'r str>; #[inline] fn next(&mut self) -> Option<Option<&'r str>> { self.0.next() } #[inline] fn size_hint(&self) -> (usize, Option<usize>) { self.0.size_hint() } #[inline] fn count(self) -> usize { self.0.count() } } impl<'r> ExactSizeIterator for CaptureNames<'r> {} impl<'r> core::iter::FusedIterator for CaptureNames<'r> {} /// An iterator over all group matches in a [`Captures`] value. /// /// This iterator yields values of type `Option<Match<'h>>`, where `'h` is the /// lifetime of the haystack that the matches are for. The order of elements /// yielded corresponds to the order of the opening parenthesis for the group /// in the regex pattern. `None` is yielded for groups that did not participate /// in the match. /// /// The first element always corresponds to the implicit group for the overall /// match. Since this iterator is created by a [`Captures`] value, and a /// `Captures` value is only created when a match occurs, it follows that the /// first element yielded by this iterator is guaranteed to be non-`None`. /// /// The lifetime `'c` corresponds to the lifetime of the `Captures` value that /// created this iterator, and the lifetime `'h` corresponds to the originally /// matched haystack. #[derive(Clone, Debug)] pub struct SubCaptureMatches<'c, 'h> { haystack: &'h [u8], it: captures::CapturesPatternIter<'c>, } impl<'c, 'h> Iterator for SubCaptureMatches<'c, 'h> { type Item = Option<Match<'h>>; #[inline] fn next(&mut self) -> Option<Option<Match<'h>>> { self.it.next().map(|group| { group.map(|sp| Match::new(self.haystack, sp.start, sp.end)) }) } #[inline] fn size_hint(&self) -> (usize, Option<usize>) { self.it.size_hint() } #[inline] fn count(self) -> usize { self.it.count() } } impl<'c, 'h> ExactSizeIterator for SubCaptureMatches<'c, 'h> {} impl<'c, 'h> core::iter::FusedIterator for SubCaptureMatches<'c, 'h> {} /// A trait for types that can be used to replace matches in a haystack. /// /// In general, users of this crate shouldn't need to implement this trait, /// since implementations are already provided for `&[u8]` along with other /// variants of byte string types, as well as `FnMut(&Captures) -> Vec<u8>` (or /// any `FnMut(&Captures) -> T` where `T: AsRef<[u8]>`). Those cover most use /// cases, but callers can implement this trait directly if necessary. /// /// # Example /// /// This example shows a basic implementation of the `Replacer` trait. This can /// be done much more simply using the replacement byte string interpolation /// support (e.g., `$first $last`), but this approach avoids needing to parse /// the replacement byte string at all. /// /// ``` /// use regex::bytes::{Captures, Regex, Replacer}; /// /// struct NameSwapper; /// /// impl Replacer for NameSwapper { /// fn replace_append(&mut self, caps: &Captures<'_>, dst: &mut Vec<u8>) { /// dst.extend_from_slice(&caps["first"]); /// dst.extend_from_slice(b" "); /// dst.extend_from_slice(&caps["last"]); /// } /// } /// /// let re = Regex::new(r"(?<last>[^,\s]+),\s+(?<first>\S+)").unwrap(); /// let result = re.replace(b"Springsteen, Bruce", NameSwapper); /// assert_eq!(result, &b"Bruce Springsteen"[..]); /// ``` pub trait Replacer { /// Appends possibly empty data to `dst` to replace the current match. /// /// The current match is represented by `caps`, which is guaranteed to have /// a match at capture group `0`. /// /// For example, a no-op replacement would be /// `dst.extend_from_slice(&caps[0])`. fn replace_append(&mut self, caps: &Captures<'_>, dst: &mut Vec<u8>); /// Return a fixed unchanging replacement byte string. /// /// When doing replacements, if access to [`Captures`] is not needed (e.g., /// the replacement byte string does not need `$` expansion), then it can /// be beneficial to avoid finding sub-captures. /// /// In general, this is called once for every call to a replacement routine /// such as [`Regex::replace_all`]. fn no_expansion<'r>(&'r mut self) -> Option<Cow<'r, [u8]>> { None } /// Returns a type that implements `Replacer`, but that borrows and wraps /// this `Replacer`. /// /// This is useful when you want to take a generic `Replacer` (which might /// not be cloneable) and use it without consuming it, so it can be used /// more than once. /// /// # Example /// /// ``` /// use regex::bytes::{Regex, Replacer}; /// /// fn replace_all_twice<R: Replacer>( /// re: Regex, /// src: &[u8], /// mut rep: R, /// ) -> Vec<u8> { /// let dst = re.replace_all(src, rep.by_ref()); /// let dst = re.replace_all(&dst, rep.by_ref()); /// dst.into_owned() /// } /// ``` fn by_ref<'r>(&'r mut self) -> ReplacerRef<'r, Self> { ReplacerRef(self) } } impl<'a, const N: usize> Replacer for &'a [u8; N] { fn replace_append(&mut self, caps: &Captures<'_>, dst: &mut Vec<u8>) { caps.expand(&**self, dst); } fn no_expansion(&mut self) -> Option<Cow<'_, [u8]>> { no_expansion(self) } } impl<const N: usize> Replacer for [u8; N] { fn replace_append(&mut self, caps: &Captures<'_>, dst: &mut Vec<u8>) { caps.expand(&*self, dst); } fn no_expansion(&mut self) -> Option<Cow<'_, [u8]>> { no_expansion(self) } } impl<'a> Replacer for &'a [u8] { fn replace_append(&mut self, caps: &Captures<'_>, dst: &mut Vec<u8>) { caps.expand(*self, dst); } fn no_expansion(&mut self) -> Option<Cow<'_, [u8]>> { no_expansion(self) } } impl<'a> Replacer for &'a Vec<u8> { fn replace_append(&mut self, caps: &Captures<'_>, dst: &mut Vec<u8>) { caps.expand(*self, dst); } fn no_expansion(&mut self) -> Option<Cow<'_, [u8]>> { no_expansion(self) } } impl Replacer for Vec<u8> { fn replace_append(&mut self, caps: &Captures<'_>, dst: &mut Vec<u8>) { caps.expand(self, dst); } fn no_expansion(&mut self) -> Option<Cow<'_, [u8]>> { no_expansion(self) } } impl<'a> Replacer for Cow<'a, [u8]> { fn replace_append(&mut self, caps: &Captures<'_>, dst: &mut Vec<u8>) { caps.expand(self.as_ref(), dst); } fn no_expansion(&mut self) -> Option<Cow<'_, [u8]>> { no_expansion(self) } } impl<'a> Replacer for &'a Cow<'a, [u8]> { fn replace_append(&mut self, caps: &Captures<'_>, dst: &mut Vec<u8>) { caps.expand(self.as_ref(), dst); } fn no_expansion(&mut self) -> Option<Cow<'_, [u8]>> { no_expansion(self) } } impl<F, T> Replacer for F where F: FnMut(&Captures<'_>) -> T, T: AsRef<[u8]>, { fn replace_append(&mut self, caps: &Captures<'_>, dst: &mut Vec<u8>) { dst.extend_from_slice((*self)(caps).as_ref()); } } /// A by-reference adaptor for a [`Replacer`]. /// /// This permits reusing the same `Replacer` value in multiple calls to a /// replacement routine like [`Regex::replace_all`]. /// /// This type is created by [`Replacer::by_ref`]. #[derive(Debug)] pub struct ReplacerRef<'a, R: ?Sized>(&'a mut R); impl<'a, R: Replacer + ?Sized + 'a> Replacer for ReplacerRef<'a, R> { fn replace_append(&mut self, caps: &Captures<'_>, dst: &mut Vec<u8>) { self.0.replace_append(caps, dst) } fn no_expansion<'r>(&'r mut self) -> Option<Cow<'r, [u8]>> { self.0.no_expansion() } } /// A helper type for forcing literal string replacement. /// /// It can be used with routines like [`Regex::replace`] and /// [`Regex::replace_all`] to do a literal string replacement without expanding /// `$name` to their corresponding capture groups. This can be both convenient /// (to avoid escaping `$`, for example) and faster (since capture groups /// don't need to be found). /// /// `'s` is the lifetime of the literal string to use. /// /// # Example /// /// ``` /// use regex::bytes::{NoExpand, Regex}; /// /// let re = Regex::new(r"(?<last>[^,\s]+),\s+(\S+)").unwrap(); /// let result = re.replace(b"Springsteen, Bruce", NoExpand(b"$2 $last")); /// assert_eq!(result, &b"$2 $last"[..]); /// ``` #[derive(Clone, Debug)] pub struct NoExpand<'s>(pub &'s [u8]); impl<'s> Replacer for NoExpand<'s> { fn replace_append(&mut self, _: &Captures<'_>, dst: &mut Vec<u8>) { dst.extend_from_slice(self.0); } fn no_expansion(&mut self) -> Option<Cow<'_, [u8]>> { Some(Cow::Borrowed(self.0)) } } /// Quickly checks the given replacement string for whether interpolation /// should be done on it. It returns `None` if a `$` was found anywhere in the /// given string, which suggests interpolation needs to be done. But if there's /// no `$` anywhere, then interpolation definitely does not need to be done. In /// that case, the given string is returned as a borrowed `Cow`. /// /// This is meant to be used to implement the `Replacer::no_expansion` method /// in its various trait impls. fn no_expansion<T: AsRef<[u8]>>(replacement: &T) -> Option<Cow<'_, [u8]>> { let replacement = replacement.as_ref(); match crate::find_byte::find_byte(b'$', replacement) { Some(_) => None, None => Some(Cow::Borrowed(replacement)), } } #[cfg(test)] mod tests { use super::*; use alloc::format; #[test] fn test_match_properties() { let haystack = b"Hello, world!"; let m = Match::new(haystack, 7, 12); assert_eq!(m.start(), 7); assert_eq!(m.end(), 12); assert_eq!(m.is_empty(), false); assert_eq!(m.len(), 5); assert_eq!(m.as_bytes(), b"world"); } #[test] fn test_empty_match() { let haystack = b""; let m = Match::new(haystack, 0, 0); assert_eq!(m.is_empty(), true); assert_eq!(m.len(), 0); } #[test] fn test_debug_output_valid_utf8() { let haystack = b"Hello, world!"; let m = Match::new(haystack, 7, 12); let debug_str = format!("{m:?}"); assert_eq!( debug_str, r#"Match { start: 7, end: 12, bytes: "world" }"# ); } #[test] fn test_debug_output_invalid_utf8() { let haystack = b"Hello, \xFFworld!"; let m = Match::new(haystack, 7, 13); let debug_str = format!("{m:?}"); assert_eq!( debug_str, r#"Match { start: 7, end: 13, bytes: "\xffworld" }"# ); } #[test] fn test_debug_output_various_unicode() { let haystack = "Hello, ๐Ÿ˜Š world! ์•ˆ๋…•ํ•˜์„ธ์š”? ู…ุฑุญุจุง ุจุงู„ุนุงู„ู…!".as_bytes(); let m = Match::new(haystack, 0, haystack.len()); let debug_str = format!("{m:?}"); assert_eq!( debug_str, r#"Match { start: 0, end: 62, bytes: "Hello, ๐Ÿ˜Š world! ์•ˆ๋…•ํ•˜์„ธ์š”? ู…ุฑุญุจุง ุจุงู„ุนุงู„ู…!" }"# ); } #[test] fn test_debug_output_ascii_escape() { let haystack = b"Hello,\tworld!\nThis is a \x1b[31mtest\x1b[0m."; let m = Match::new(haystack, 0, haystack.len()); let debug_str = format!("{m:?}"); assert_eq!( debug_str, r#"Match { start: 0, end: 38, bytes: "Hello,\tworld!\nThis is a \u{1b}[31mtest\u{1b}[0m." }"# ); } #[test] fn test_debug_output_match_in_middle() { let haystack = b"The quick brown fox jumps over the lazy dog."; let m = Match::new(haystack, 16, 19); let debug_str = format!("{m:?}"); assert_eq!(debug_str, r#"Match { start: 16, end: 19, bytes: "fox" }"#); } } ���������������������������������������������������������������������������������������������������������������������������������������������������������������������������regex-1.12.2/src/regex/mod.rs�����������������������������������������������������������������������0000644�0000000�0000000�00000000055�10461020230�0014047�0����������������������������������������������������������������������������������������������������ustar �����������������������������������������������������������������0000000�0000000������������������������������������������������������������������������������������������������������������������������������������������������������������������������pub(crate) mod bytes; pub(crate) mod string; �����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������regex-1.12.2/src/regex/string.rs��������������������������������������������������������������������0000644�0000000�0000000�00000273254�10461020230�0014613�0����������������������������������������������������������������������������������������������������ustar �����������������������������������������������������������������0000000�0000000������������������������������������������������������������������������������������������������������������������������������������������������������������������������use alloc::{borrow::Cow, string::String, sync::Arc}; use regex_automata::{meta, util::captures, Input, PatternID}; use crate::{error::Error, RegexBuilder}; /// A compiled regular expression for searching Unicode haystacks. /// /// A `Regex` can be used to search haystacks, split haystacks into substrings /// or replace substrings in a haystack with a different substring. All /// searching is done with an implicit `(?s:.)*?` at the beginning and end of /// an pattern. To force an expression to match the whole string (or a prefix /// or a suffix), you must use an anchor like `^` or `$` (or `\A` and `\z`). /// /// While this crate will handle Unicode strings (whether in the regular /// expression or in the haystack), all positions returned are **byte /// offsets**. Every byte offset is guaranteed to be at a Unicode code point /// boundary. That is, all offsets returned by the `Regex` API are guaranteed /// to be ranges that can slice a `&str` without panicking. If you want to /// relax this requirement, then you must search `&[u8]` haystacks with a /// [`bytes::Regex`](crate::bytes::Regex). /// /// The only methods that allocate new strings are the string replacement /// methods. All other methods (searching and splitting) return borrowed /// references into the haystack given. /// /// # Example /// /// Find the offsets of a US phone number: /// /// ``` /// use regex::Regex; /// /// let re = Regex::new("[0-9]{3}-[0-9]{3}-[0-9]{4}").unwrap(); /// let m = re.find("phone: 111-222-3333").unwrap(); /// assert_eq!(7..19, m.range()); /// ``` /// /// # Example: extracting capture groups /// /// A common way to use regexes is with capture groups. That is, instead of /// just looking for matches of an entire regex, parentheses are used to create /// groups that represent part of the match. /// /// For example, consider a haystack with multiple lines, and each line has /// three whitespace delimited fields where the second field is expected to be /// a number and the third field a boolean. To make this convenient, we use /// the [`Captures::extract`] API to put the strings that match each group /// into a fixed size array: /// /// ``` /// use regex::Regex; /// /// let hay = " /// rabbit 54 true /// groundhog 2 true /// does not match /// fox 109 false /// "; /// let re = Regex::new(r"(?m)^\s*(\S+)\s+([0-9]+)\s+(true|false)\s*$").unwrap(); /// let mut fields: Vec<(&str, i64, bool)> = vec![]; /// for (_, [f1, f2, f3]) in re.captures_iter(hay).map(|caps| caps.extract()) { /// fields.push((f1, f2.parse()?, f3.parse()?)); /// } /// assert_eq!(fields, vec![ /// ("rabbit", 54, true), /// ("groundhog", 2, true), /// ("fox", 109, false), /// ]); /// /// # Ok::<(), Box<dyn std::error::Error>>(()) /// ``` /// /// # Example: searching with the `Pattern` trait /// /// **Note**: This section requires that this crate is compiled with the /// `pattern` Cargo feature enabled, which **requires nightly Rust**. /// /// Since `Regex` implements `Pattern` from the standard library, one can /// use regexes with methods defined on `&str`. For example, `is_match`, /// `find`, `find_iter` and `split` can, in some cases, be replaced with /// `str::contains`, `str::find`, `str::match_indices` and `str::split`. /// /// Here are some examples: /// /// ```ignore /// use regex::Regex; /// /// let re = Regex::new(r"\d+").unwrap(); /// let hay = "a111b222c"; /// /// assert!(hay.contains(&re)); /// assert_eq!(hay.find(&re), Some(1)); /// assert_eq!(hay.match_indices(&re).collect::<Vec<_>>(), vec![ /// (1, "111"), /// (5, "222"), /// ]); /// assert_eq!(hay.split(&re).collect::<Vec<_>>(), vec!["a", "b", "c"]); /// ``` #[derive(Clone)] pub struct Regex { pub(crate) meta: meta::Regex, pub(crate) pattern: Arc<str>, } impl core::fmt::Display for Regex { /// Shows the original regular expression. fn fmt(&self, f: &mut core::fmt::Formatter<'_>) -> core::fmt::Result { write!(f, "{}", self.as_str()) } } impl core::fmt::Debug for Regex { /// Shows the original regular expression. fn fmt(&self, f: &mut core::fmt::Formatter<'_>) -> core::fmt::Result { f.debug_tuple("Regex").field(&self.as_str()).finish() } } impl core::str::FromStr for Regex { type Err = Error; /// Attempts to parse a string into a regular expression fn from_str(s: &str) -> Result<Regex, Error> { Regex::new(s) } } impl TryFrom<&str> for Regex { type Error = Error; /// Attempts to parse a string into a regular expression fn try_from(s: &str) -> Result<Regex, Error> { Regex::new(s) } } impl TryFrom<String> for Regex { type Error = Error; /// Attempts to parse a string into a regular expression fn try_from(s: String) -> Result<Regex, Error> { Regex::new(&s) } } /// Core regular expression methods. impl Regex { /// Compiles a regular expression. Once compiled, it can be used repeatedly /// to search, split or replace substrings in a haystack. /// /// Note that regex compilation tends to be a somewhat expensive process, /// and unlike higher level environments, compilation is not automatically /// cached for you. One should endeavor to compile a regex once and then /// reuse it. For example, it's a bad idea to compile the same regex /// repeatedly in a loop. /// /// # Errors /// /// If an invalid pattern is given, then an error is returned. /// An error is also returned if the pattern is valid, but would /// produce a regex that is bigger than the configured size limit via /// [`RegexBuilder::size_limit`]. (A reasonable size limit is enabled by /// default.) /// /// # Example /// /// ``` /// use regex::Regex; /// /// // An Invalid pattern because of an unclosed parenthesis /// assert!(Regex::new(r"foo(bar").is_err()); /// // An invalid pattern because the regex would be too big /// // because Unicode tends to inflate things. /// assert!(Regex::new(r"\w{1000}").is_err()); /// // Disabling Unicode can make the regex much smaller, /// // potentially by up to or more than an order of magnitude. /// assert!(Regex::new(r"(?-u:\w){1000}").is_ok()); /// ``` pub fn new(re: &str) -> Result<Regex, Error> { RegexBuilder::new(re).build() } /// Returns true if and only if there is a match for the regex anywhere /// in the haystack given. /// /// It is recommended to use this method if all you need to do is test /// whether a match exists, since the underlying matching engine may be /// able to do less work. /// /// # Example /// /// Test if some haystack contains at least one word with exactly 13 /// Unicode word characters: /// /// ``` /// use regex::Regex; /// /// let re = Regex::new(r"\b\w{13}\b").unwrap(); /// let hay = "I categorically deny having triskaidekaphobia."; /// assert!(re.is_match(hay)); /// ``` #[inline] pub fn is_match(&self, haystack: &str) -> bool { self.is_match_at(haystack, 0) } /// This routine searches for the first match of this regex in the /// haystack given, and if found, returns a [`Match`]. The `Match` /// provides access to both the byte offsets of the match and the actual /// substring that matched. /// /// Note that this should only be used if you want to find the entire /// match. If instead you just want to test the existence of a match, /// it's potentially faster to use `Regex::is_match(hay)` instead of /// `Regex::find(hay).is_some()`. /// /// # Example /// /// Find the first word with exactly 13 Unicode word characters: /// /// ``` /// use regex::Regex; /// /// let re = Regex::new(r"\b\w{13}\b").unwrap(); /// let hay = "I categorically deny having triskaidekaphobia."; /// let mat = re.find(hay).unwrap(); /// assert_eq!(2..15, mat.range()); /// assert_eq!("categorically", mat.as_str()); /// ``` #[inline] pub fn find<'h>(&self, haystack: &'h str) -> Option<Match<'h>> { self.find_at(haystack, 0) } /// Returns an iterator that yields successive non-overlapping matches in /// the given haystack. The iterator yields values of type [`Match`]. /// /// # Time complexity /// /// Note that since `find_iter` runs potentially many searches on the /// haystack and since each search has worst case `O(m * n)` time /// complexity, the overall worst case time complexity for iteration is /// `O(m * n^2)`. /// /// # Example /// /// Find every word with exactly 13 Unicode word characters: /// /// ``` /// use regex::Regex; /// /// let re = Regex::new(r"\b\w{13}\b").unwrap(); /// let hay = "Retroactively relinquishing remunerations is reprehensible."; /// let matches: Vec<_> = re.find_iter(hay).map(|m| m.as_str()).collect(); /// assert_eq!(matches, vec![ /// "Retroactively", /// "relinquishing", /// "remunerations", /// "reprehensible", /// ]); /// ``` #[inline] pub fn find_iter<'r, 'h>(&'r self, haystack: &'h str) -> Matches<'r, 'h> { Matches { haystack, it: self.meta.find_iter(haystack) } } /// This routine searches for the first match of this regex in the haystack /// given, and if found, returns not only the overall match but also the /// matches of each capture group in the regex. If no match is found, then /// `None` is returned. /// /// Capture group `0` always corresponds to an implicit unnamed group that /// includes the entire match. If a match is found, this group is always /// present. Subsequent groups may be named and are numbered, starting /// at 1, by the order in which the opening parenthesis appears in the /// pattern. For example, in the pattern `(?<a>.(?<b>.))(?<c>.)`, `a`, /// `b` and `c` correspond to capture group indices `1`, `2` and `3`, /// respectively. /// /// You should only use `captures` if you need access to the capture group /// matches. Otherwise, [`Regex::find`] is generally faster for discovering /// just the overall match. /// /// # Example /// /// Say you have some haystack with movie names and their release years, /// like "'Citizen Kane' (1941)". It'd be nice if we could search for /// substrings looking like that, while also extracting the movie name and /// its release year separately. The example below shows how to do that. /// /// ``` /// use regex::Regex; /// /// let re = Regex::new(r"'([^']+)'\s+\((\d{4})\)").unwrap(); /// let hay = "Not my favorite movie: 'Citizen Kane' (1941)."; /// let caps = re.captures(hay).unwrap(); /// assert_eq!(caps.get(0).unwrap().as_str(), "'Citizen Kane' (1941)"); /// assert_eq!(caps.get(1).unwrap().as_str(), "Citizen Kane"); /// assert_eq!(caps.get(2).unwrap().as_str(), "1941"); /// // You can also access the groups by index using the Index notation. /// // Note that this will panic on an invalid index. In this case, these /// // accesses are always correct because the overall regex will only /// // match when these capture groups match. /// assert_eq!(&caps[0], "'Citizen Kane' (1941)"); /// assert_eq!(&caps[1], "Citizen Kane"); /// assert_eq!(&caps[2], "1941"); /// ``` /// /// Note that the full match is at capture group `0`. Each subsequent /// capture group is indexed by the order of its opening `(`. /// /// We can make this example a bit clearer by using *named* capture groups: /// /// ``` /// use regex::Regex; /// /// let re = Regex::new(r"'(?<title>[^']+)'\s+\((?<year>\d{4})\)").unwrap(); /// let hay = "Not my favorite movie: 'Citizen Kane' (1941)."; /// let caps = re.captures(hay).unwrap(); /// assert_eq!(caps.get(0).unwrap().as_str(), "'Citizen Kane' (1941)"); /// assert_eq!(caps.name("title").unwrap().as_str(), "Citizen Kane"); /// assert_eq!(caps.name("year").unwrap().as_str(), "1941"); /// // You can also access the groups by name using the Index notation. /// // Note that this will panic on an invalid group name. In this case, /// // these accesses are always correct because the overall regex will /// // only match when these capture groups match. /// assert_eq!(&caps[0], "'Citizen Kane' (1941)"); /// assert_eq!(&caps["title"], "Citizen Kane"); /// assert_eq!(&caps["year"], "1941"); /// ``` /// /// Here we name the capture groups, which we can access with the `name` /// method or the `Index` notation with a `&str`. Note that the named /// capture groups are still accessible with `get` or the `Index` notation /// with a `usize`. /// /// The `0`th capture group is always unnamed, so it must always be /// accessed with `get(0)` or `[0]`. /// /// Finally, one other way to get the matched substrings is with the /// [`Captures::extract`] API: /// /// ``` /// use regex::Regex; /// /// let re = Regex::new(r"'([^']+)'\s+\((\d{4})\)").unwrap(); /// let hay = "Not my favorite movie: 'Citizen Kane' (1941)."; /// let (full, [title, year]) = re.captures(hay).unwrap().extract(); /// assert_eq!(full, "'Citizen Kane' (1941)"); /// assert_eq!(title, "Citizen Kane"); /// assert_eq!(year, "1941"); /// ``` #[inline] pub fn captures<'h>(&self, haystack: &'h str) -> Option<Captures<'h>> { self.captures_at(haystack, 0) } /// Returns an iterator that yields successive non-overlapping matches in /// the given haystack. The iterator yields values of type [`Captures`]. /// /// This is the same as [`Regex::find_iter`], but instead of only providing /// access to the overall match, each value yield includes access to the /// matches of all capture groups in the regex. Reporting this extra match /// data is potentially costly, so callers should only use `captures_iter` /// over `find_iter` when they actually need access to the capture group /// matches. /// /// # Time complexity /// /// Note that since `captures_iter` runs potentially many searches on the /// haystack and since each search has worst case `O(m * n)` time /// complexity, the overall worst case time complexity for iteration is /// `O(m * n^2)`. /// /// # Example /// /// We can use this to find all movie titles and their release years in /// some haystack, where the movie is formatted like "'Title' (xxxx)": /// /// ``` /// use regex::Regex; /// /// let re = Regex::new(r"'([^']+)'\s+\(([0-9]{4})\)").unwrap(); /// let hay = "'Citizen Kane' (1941), 'The Wizard of Oz' (1939), 'M' (1931)."; /// let mut movies = vec![]; /// for (_, [title, year]) in re.captures_iter(hay).map(|c| c.extract()) { /// movies.push((title, year.parse::<i64>()?)); /// } /// assert_eq!(movies, vec![ /// ("Citizen Kane", 1941), /// ("The Wizard of Oz", 1939), /// ("M", 1931), /// ]); /// # Ok::<(), Box<dyn std::error::Error>>(()) /// ``` /// /// Or with named groups: /// /// ``` /// use regex::Regex; /// /// let re = Regex::new(r"'(?<title>[^']+)'\s+\((?<year>[0-9]{4})\)").unwrap(); /// let hay = "'Citizen Kane' (1941), 'The Wizard of Oz' (1939), 'M' (1931)."; /// let mut it = re.captures_iter(hay); /// /// let caps = it.next().unwrap(); /// assert_eq!(&caps["title"], "Citizen Kane"); /// assert_eq!(&caps["year"], "1941"); /// /// let caps = it.next().unwrap(); /// assert_eq!(&caps["title"], "The Wizard of Oz"); /// assert_eq!(&caps["year"], "1939"); /// /// let caps = it.next().unwrap(); /// assert_eq!(&caps["title"], "M"); /// assert_eq!(&caps["year"], "1931"); /// ``` #[inline] pub fn captures_iter<'r, 'h>( &'r self, haystack: &'h str, ) -> CaptureMatches<'r, 'h> { CaptureMatches { haystack, it: self.meta.captures_iter(haystack) } } /// Returns an iterator of substrings of the haystack given, delimited by a /// match of the regex. Namely, each element of the iterator corresponds to /// a part of the haystack that *isn't* matched by the regular expression. /// /// # Time complexity /// /// Since iterators over all matches requires running potentially many /// searches on the haystack, and since each search has worst case /// `O(m * n)` time complexity, the overall worst case time complexity for /// this routine is `O(m * n^2)`. /// /// # Example /// /// To split a string delimited by arbitrary amounts of spaces or tabs: /// /// ``` /// use regex::Regex; /// /// let re = Regex::new(r"[ \t]+").unwrap(); /// let hay = "a b \t c\td e"; /// let fields: Vec<&str> = re.split(hay).collect(); /// assert_eq!(fields, vec!["a", "b", "c", "d", "e"]); /// ``` /// /// # Example: more cases /// /// Basic usage: /// /// ``` /// use regex::Regex; /// /// let re = Regex::new(r" ").unwrap(); /// let hay = "Mary had a little lamb"; /// let got: Vec<&str> = re.split(hay).collect(); /// assert_eq!(got, vec!["Mary", "had", "a", "little", "lamb"]); /// /// let re = Regex::new(r"X").unwrap(); /// let hay = ""; /// let got: Vec<&str> = re.split(hay).collect(); /// assert_eq!(got, vec![""]); /// /// let re = Regex::new(r"X").unwrap(); /// let hay = "lionXXtigerXleopard"; /// let got: Vec<&str> = re.split(hay).collect(); /// assert_eq!(got, vec!["lion", "", "tiger", "leopard"]); /// /// let re = Regex::new(r"::").unwrap(); /// let hay = "lion::tiger::leopard"; /// let got: Vec<&str> = re.split(hay).collect(); /// assert_eq!(got, vec!["lion", "tiger", "leopard"]); /// ``` /// /// If a haystack contains multiple contiguous matches, you will end up /// with empty spans yielded by the iterator: /// /// ``` /// use regex::Regex; /// /// let re = Regex::new(r"X").unwrap(); /// let hay = "XXXXaXXbXc"; /// let got: Vec<&str> = re.split(hay).collect(); /// assert_eq!(got, vec!["", "", "", "", "a", "", "b", "c"]); /// /// let re = Regex::new(r"/").unwrap(); /// let hay = "(///)"; /// let got: Vec<&str> = re.split(hay).collect(); /// assert_eq!(got, vec!["(", "", "", ")"]); /// ``` /// /// Separators at the start or end of a haystack are neighbored by empty /// substring. /// /// ``` /// use regex::Regex; /// /// let re = Regex::new(r"0").unwrap(); /// let hay = "010"; /// let got: Vec<&str> = re.split(hay).collect(); /// assert_eq!(got, vec!["", "1", ""]); /// ``` /// /// When the empty string is used as a regex, it splits at every valid /// UTF-8 boundary by default (which includes the beginning and end of the /// haystack): /// /// ``` /// use regex::Regex; /// /// let re = Regex::new(r"").unwrap(); /// let hay = "rust"; /// let got: Vec<&str> = re.split(hay).collect(); /// assert_eq!(got, vec!["", "r", "u", "s", "t", ""]); /// /// // Splitting by an empty string is UTF-8 aware by default! /// let re = Regex::new(r"").unwrap(); /// let hay = "โ˜ƒ"; /// let got: Vec<&str> = re.split(hay).collect(); /// assert_eq!(got, vec!["", "โ˜ƒ", ""]); /// ``` /// /// Contiguous separators (commonly shows up with whitespace), can lead to /// possibly surprising behavior. For example, this code is correct: /// /// ``` /// use regex::Regex; /// /// let re = Regex::new(r" ").unwrap(); /// let hay = " a b c"; /// let got: Vec<&str> = re.split(hay).collect(); /// assert_eq!(got, vec!["", "", "", "", "a", "", "b", "c"]); /// ``` /// /// It does *not* give you `["a", "b", "c"]`. For that behavior, you'd want /// to match contiguous space characters: /// /// ``` /// use regex::Regex; /// /// let re = Regex::new(r" +").unwrap(); /// let hay = " a b c"; /// let got: Vec<&str> = re.split(hay).collect(); /// // N.B. This does still include a leading empty span because ' +' /// // matches at the beginning of the haystack. /// assert_eq!(got, vec!["", "a", "b", "c"]); /// ``` #[inline] pub fn split<'r, 'h>(&'r self, haystack: &'h str) -> Split<'r, 'h> { Split { haystack, it: self.meta.split(haystack) } } /// Returns an iterator of at most `limit` substrings of the haystack /// given, delimited by a match of the regex. (A `limit` of `0` will return /// no substrings.) Namely, each element of the iterator corresponds to a /// part of the haystack that *isn't* matched by the regular expression. /// The remainder of the haystack that is not split will be the last /// element in the iterator. /// /// # Time complexity /// /// Since iterators over all matches requires running potentially many /// searches on the haystack, and since each search has worst case /// `O(m * n)` time complexity, the overall worst case time complexity for /// this routine is `O(m * n^2)`. /// /// Although note that the worst case time here has an upper bound given /// by the `limit` parameter. /// /// # Example /// /// Get the first two words in some haystack: /// /// ``` /// use regex::Regex; /// /// let re = Regex::new(r"\W+").unwrap(); /// let hay = "Hey! How are you?"; /// let fields: Vec<&str> = re.splitn(hay, 3).collect(); /// assert_eq!(fields, vec!["Hey", "How", "are you?"]); /// ``` /// /// # Examples: more cases /// /// ``` /// use regex::Regex; /// /// let re = Regex::new(r" ").unwrap(); /// let hay = "Mary had a little lamb"; /// let got: Vec<&str> = re.splitn(hay, 3).collect(); /// assert_eq!(got, vec!["Mary", "had", "a little lamb"]); /// /// let re = Regex::new(r"X").unwrap(); /// let hay = ""; /// let got: Vec<&str> = re.splitn(hay, 3).collect(); /// assert_eq!(got, vec![""]); /// /// let re = Regex::new(r"X").unwrap(); /// let hay = "lionXXtigerXleopard"; /// let got: Vec<&str> = re.splitn(hay, 3).collect(); /// assert_eq!(got, vec!["lion", "", "tigerXleopard"]); /// /// let re = Regex::new(r"::").unwrap(); /// let hay = "lion::tiger::leopard"; /// let got: Vec<&str> = re.splitn(hay, 2).collect(); /// assert_eq!(got, vec!["lion", "tiger::leopard"]); /// /// let re = Regex::new(r"X").unwrap(); /// let hay = "abcXdef"; /// let got: Vec<&str> = re.splitn(hay, 1).collect(); /// assert_eq!(got, vec!["abcXdef"]); /// /// let re = Regex::new(r"X").unwrap(); /// let hay = "abcdef"; /// let got: Vec<&str> = re.splitn(hay, 2).collect(); /// assert_eq!(got, vec!["abcdef"]); /// /// let re = Regex::new(r"X").unwrap(); /// let hay = "abcXdef"; /// let got: Vec<&str> = re.splitn(hay, 0).collect(); /// assert!(got.is_empty()); /// ``` #[inline] pub fn splitn<'r, 'h>( &'r self, haystack: &'h str, limit: usize, ) -> SplitN<'r, 'h> { SplitN { haystack, it: self.meta.splitn(haystack, limit) } } /// Replaces the leftmost-first match in the given haystack with the /// replacement provided. The replacement can be a regular string (where /// `$N` and `$name` are expanded to match capture groups) or a function /// that takes a [`Captures`] and returns the replaced string. /// /// If no match is found, then the haystack is returned unchanged. In that /// case, this implementation will likely return a `Cow::Borrowed` value /// such that no allocation is performed. /// /// When a `Cow::Borrowed` is returned, the value returned is guaranteed /// to be equivalent to the `haystack` given. /// /// # Replacement string syntax /// /// All instances of `$ref` in the replacement string are replaced with /// the substring corresponding to the capture group identified by `ref`. /// /// `ref` may be an integer corresponding to the index of the capture group /// (counted by order of opening parenthesis where `0` is the entire match) /// or it can be a name (consisting of letters, digits or underscores) /// corresponding to a named capture group. /// /// If `ref` isn't a valid capture group (whether the name doesn't exist or /// isn't a valid index), then it is replaced with the empty string. /// /// The longest possible name is used. For example, `$1a` looks up the /// capture group named `1a` and not the capture group at index `1`. To /// exert more precise control over the name, use braces, e.g., `${1}a`. /// /// To write a literal `$` use `$$`. /// /// # Example /// /// Note that this function is polymorphic with respect to the replacement. /// In typical usage, this can just be a normal string: /// /// ``` /// use regex::Regex; /// /// let re = Regex::new(r"[^01]+").unwrap(); /// assert_eq!(re.replace("1078910", ""), "1010"); /// ``` /// /// But anything satisfying the [`Replacer`] trait will work. For example, /// a closure of type `|&Captures| -> String` provides direct access to the /// captures corresponding to a match. This allows one to access capturing /// group matches easily: /// /// ``` /// use regex::{Captures, Regex}; /// /// let re = Regex::new(r"([^,\s]+),\s+(\S+)").unwrap(); /// let result = re.replace("Springsteen, Bruce", |caps: &Captures| { /// format!("{} {}", &caps[2], &caps[1]) /// }); /// assert_eq!(result, "Bruce Springsteen"); /// ``` /// /// But this is a bit cumbersome to use all the time. Instead, a simple /// syntax is supported (as described above) that expands `$name` into the /// corresponding capture group. Here's the last example, but using this /// expansion technique with named capture groups: /// /// ``` /// use regex::Regex; /// /// let re = Regex::new(r"(?<last>[^,\s]+),\s+(?<first>\S+)").unwrap(); /// let result = re.replace("Springsteen, Bruce", "$first $last"); /// assert_eq!(result, "Bruce Springsteen"); /// ``` /// /// Note that using `$2` instead of `$first` or `$1` instead of `$last` /// would produce the same result. To write a literal `$` use `$$`. /// /// Sometimes the replacement string requires use of curly braces to /// delineate a capture group replacement when it is adjacent to some other /// literal text. For example, if we wanted to join two words together with /// an underscore: /// /// ``` /// use regex::Regex; /// /// let re = Regex::new(r"(?<first>\w+)\s+(?<second>\w+)").unwrap(); /// let result = re.replace("deep fried", "${first}_$second"); /// assert_eq!(result, "deep_fried"); /// ``` /// /// Without the curly braces, the capture group name `first_` would be /// used, and since it doesn't exist, it would be replaced with the empty /// string. /// /// Finally, sometimes you just want to replace a literal string with no /// regard for capturing group expansion. This can be done by wrapping a /// string with [`NoExpand`]: /// /// ``` /// use regex::{NoExpand, Regex}; /// /// let re = Regex::new(r"(?<last>[^,\s]+),\s+(\S+)").unwrap(); /// let result = re.replace("Springsteen, Bruce", NoExpand("$2 $last")); /// assert_eq!(result, "$2 $last"); /// ``` /// /// Using `NoExpand` may also be faster, since the replacement string won't /// need to be parsed for the `$` syntax. #[inline] pub fn replace<'h, R: Replacer>( &self, haystack: &'h str, rep: R, ) -> Cow<'h, str> { self.replacen(haystack, 1, rep) } /// Replaces all non-overlapping matches in the haystack with the /// replacement provided. This is the same as calling `replacen` with /// `limit` set to `0`. /// /// If no match is found, then the haystack is returned unchanged. In that /// case, this implementation will likely return a `Cow::Borrowed` value /// such that no allocation is performed. /// /// When a `Cow::Borrowed` is returned, the value returned is guaranteed /// to be equivalent to the `haystack` given. /// /// The documentation for [`Regex::replace`] goes into more detail about /// what kinds of replacement strings are supported. /// /// # Time complexity /// /// Since iterators over all matches requires running potentially many /// searches on the haystack, and since each search has worst case /// `O(m * n)` time complexity, the overall worst case time complexity for /// this routine is `O(m * n^2)`. /// /// # Fallibility /// /// If you need to write a replacement routine where any individual /// replacement might "fail," doing so with this API isn't really feasible /// because there's no way to stop the search process if a replacement /// fails. Instead, if you need this functionality, you should consider /// implementing your own replacement routine: /// /// ``` /// use regex::{Captures, Regex}; /// /// fn replace_all<E>( /// re: &Regex, /// haystack: &str, /// replacement: impl Fn(&Captures) -> Result<String, E>, /// ) -> Result<String, E> { /// let mut new = String::with_capacity(haystack.len()); /// let mut last_match = 0; /// for caps in re.captures_iter(haystack) { /// let m = caps.get(0).unwrap(); /// new.push_str(&haystack[last_match..m.start()]); /// new.push_str(&replacement(&caps)?); /// last_match = m.end(); /// } /// new.push_str(&haystack[last_match..]); /// Ok(new) /// } /// /// // Let's replace each word with the number of bytes in that word. /// // But if we see a word that is "too long," we'll give up. /// let re = Regex::new(r"\w+").unwrap(); /// let replacement = |caps: &Captures| -> Result<String, &'static str> { /// if caps[0].len() >= 5 { /// return Err("word too long"); /// } /// Ok(caps[0].len().to_string()) /// }; /// assert_eq!( /// Ok("2 3 3 3?".to_string()), /// replace_all(&re, "hi how are you?", &replacement), /// ); /// assert!(replace_all(&re, "hi there", &replacement).is_err()); /// ``` /// /// # Example /// /// This example shows how to flip the order of whitespace (excluding line /// terminators) delimited fields, and normalizes the whitespace that /// delimits the fields: /// /// ``` /// use regex::Regex; /// /// let re = Regex::new(r"(?m)^(\S+)[\s--\r\n]+(\S+)$").unwrap(); /// let hay = " /// Greetings 1973 /// Wild\t1973 /// BornToRun\t\t\t\t1975 /// Darkness 1978 /// TheRiver 1980 /// "; /// let new = re.replace_all(hay, "$2 $1"); /// assert_eq!(new, " /// 1973 Greetings /// 1973 Wild /// 1975 BornToRun /// 1978 Darkness /// 1980 TheRiver /// "); /// ``` #[inline] pub fn replace_all<'h, R: Replacer>( &self, haystack: &'h str, rep: R, ) -> Cow<'h, str> { self.replacen(haystack, 0, rep) } /// Replaces at most `limit` non-overlapping matches in the haystack with /// the replacement provided. If `limit` is `0`, then all non-overlapping /// matches are replaced. That is, `Regex::replace_all(hay, rep)` is /// equivalent to `Regex::replacen(hay, 0, rep)`. /// /// If no match is found, then the haystack is returned unchanged. In that /// case, this implementation will likely return a `Cow::Borrowed` value /// such that no allocation is performed. /// /// When a `Cow::Borrowed` is returned, the value returned is guaranteed /// to be equivalent to the `haystack` given. /// /// The documentation for [`Regex::replace`] goes into more detail about /// what kinds of replacement strings are supported. /// /// # Time complexity /// /// Since iterators over all matches requires running potentially many /// searches on the haystack, and since each search has worst case /// `O(m * n)` time complexity, the overall worst case time complexity for /// this routine is `O(m * n^2)`. /// /// Although note that the worst case time here has an upper bound given /// by the `limit` parameter. /// /// # Fallibility /// /// See the corresponding section in the docs for [`Regex::replace_all`] /// for tips on how to deal with a replacement routine that can fail. /// /// # Example /// /// This example shows how to flip the order of whitespace (excluding line /// terminators) delimited fields, and normalizes the whitespace that /// delimits the fields. But we only do it for the first two matches. /// /// ``` /// use regex::Regex; /// /// let re = Regex::new(r"(?m)^(\S+)[\s--\r\n]+(\S+)$").unwrap(); /// let hay = " /// Greetings 1973 /// Wild\t1973 /// BornToRun\t\t\t\t1975 /// Darkness 1978 /// TheRiver 1980 /// "; /// let new = re.replacen(hay, 2, "$2 $1"); /// assert_eq!(new, " /// 1973 Greetings /// 1973 Wild /// BornToRun\t\t\t\t1975 /// Darkness 1978 /// TheRiver 1980 /// "); /// ``` #[inline] pub fn replacen<'h, R: Replacer>( &self, haystack: &'h str, limit: usize, mut rep: R, ) -> Cow<'h, str> { // If we know that the replacement doesn't have any capture expansions, // then we can use the fast path. The fast path can make a tremendous // difference: // // 1) We use `find_iter` instead of `captures_iter`. Not asking for // captures generally makes the regex engines faster. // 2) We don't need to look up all of the capture groups and do // replacements inside the replacement string. We just push it // at each match and be done with it. if let Some(rep) = rep.no_expansion() { let mut it = self.find_iter(haystack).enumerate().peekable(); if it.peek().is_none() { return Cow::Borrowed(haystack); } let mut new = String::with_capacity(haystack.len()); let mut last_match = 0; for (i, m) in it { new.push_str(&haystack[last_match..m.start()]); new.push_str(&rep); last_match = m.end(); if limit > 0 && i >= limit - 1 { break; } } new.push_str(&haystack[last_match..]); return Cow::Owned(new); } // The slower path, which we use if the replacement may need access to // capture groups. let mut it = self.captures_iter(haystack).enumerate().peekable(); if it.peek().is_none() { return Cow::Borrowed(haystack); } let mut new = String::with_capacity(haystack.len()); let mut last_match = 0; for (i, cap) in it { // unwrap on 0 is OK because captures only reports matches let m = cap.get(0).unwrap(); new.push_str(&haystack[last_match..m.start()]); rep.replace_append(&cap, &mut new); last_match = m.end(); if limit > 0 && i >= limit - 1 { break; } } new.push_str(&haystack[last_match..]); Cow::Owned(new) } } /// A group of advanced or "lower level" search methods. Some methods permit /// starting the search at a position greater than `0` in the haystack. Other /// methods permit reusing allocations, for example, when extracting the /// matches for capture groups. impl Regex { /// Returns the end byte offset of the first match in the haystack given. /// /// This method may have the same performance characteristics as /// `is_match`. Behaviorally, it doesn't just report whether it match /// occurs, but also the end offset for a match. In particular, the offset /// returned *may be shorter* than the proper end of the leftmost-first /// match that you would find via [`Regex::find`]. /// /// Note that it is not guaranteed that this routine finds the shortest or /// "earliest" possible match. Instead, the main idea of this API is that /// it returns the offset at the point at which the internal regex engine /// has determined that a match has occurred. This may vary depending on /// which internal regex engine is used, and thus, the offset itself may /// change based on internal heuristics. /// /// # Example /// /// Typically, `a+` would match the entire first sequence of `a` in some /// haystack, but `shortest_match` *may* give up as soon as it sees the /// first `a`. /// /// ``` /// use regex::Regex; /// /// let re = Regex::new(r"a+").unwrap(); /// let offset = re.shortest_match("aaaaa").unwrap(); /// assert_eq!(offset, 1); /// ``` #[inline] pub fn shortest_match(&self, haystack: &str) -> Option<usize> { self.shortest_match_at(haystack, 0) } /// Returns the same as [`Regex::shortest_match`], but starts the search at /// the given offset. /// /// The significance of the starting point is that it takes the surrounding /// context into consideration. For example, the `\A` anchor can only match /// when `start == 0`. /// /// If a match is found, the offset returned is relative to the beginning /// of the haystack, not the beginning of the search. /// /// # Panics /// /// This panics when `start >= haystack.len() + 1`. /// /// # Example /// /// This example shows the significance of `start` by demonstrating how it /// can be used to permit look-around assertions in a regex to take the /// surrounding context into account. /// /// ``` /// use regex::Regex; /// /// let re = Regex::new(r"\bchew\b").unwrap(); /// let hay = "eschew"; /// // We get a match here, but it's probably not intended. /// assert_eq!(re.shortest_match(&hay[2..]), Some(4)); /// // No match because the assertions take the context into account. /// assert_eq!(re.shortest_match_at(hay, 2), None); /// ``` #[inline] pub fn shortest_match_at( &self, haystack: &str, start: usize, ) -> Option<usize> { let input = Input::new(haystack).earliest(true).span(start..haystack.len()); self.meta.search_half(&input).map(|hm| hm.offset()) } /// Returns the same as [`Regex::is_match`], but starts the search at the /// given offset. /// /// The significance of the starting point is that it takes the surrounding /// context into consideration. For example, the `\A` anchor can only /// match when `start == 0`. /// /// # Panics /// /// This panics when `start >= haystack.len() + 1`. /// /// # Example /// /// This example shows the significance of `start` by demonstrating how it /// can be used to permit look-around assertions in a regex to take the /// surrounding context into account. /// /// ``` /// use regex::Regex; /// /// let re = Regex::new(r"\bchew\b").unwrap(); /// let hay = "eschew"; /// // We get a match here, but it's probably not intended. /// assert!(re.is_match(&hay[2..])); /// // No match because the assertions take the context into account. /// assert!(!re.is_match_at(hay, 2)); /// ``` #[inline] pub fn is_match_at(&self, haystack: &str, start: usize) -> bool { let input = Input::new(haystack).earliest(true).span(start..haystack.len()); self.meta.search_half(&input).is_some() } /// Returns the same as [`Regex::find`], but starts the search at the given /// offset. /// /// The significance of the starting point is that it takes the surrounding /// context into consideration. For example, the `\A` anchor can only /// match when `start == 0`. /// /// # Panics /// /// This panics when `start >= haystack.len() + 1`. /// /// # Example /// /// This example shows the significance of `start` by demonstrating how it /// can be used to permit look-around assertions in a regex to take the /// surrounding context into account. /// /// ``` /// use regex::Regex; /// /// let re = Regex::new(r"\bchew\b").unwrap(); /// let hay = "eschew"; /// // We get a match here, but it's probably not intended. /// assert_eq!(re.find(&hay[2..]).map(|m| m.range()), Some(0..4)); /// // No match because the assertions take the context into account. /// assert_eq!(re.find_at(hay, 2), None); /// ``` #[inline] pub fn find_at<'h>( &self, haystack: &'h str, start: usize, ) -> Option<Match<'h>> { let input = Input::new(haystack).span(start..haystack.len()); self.meta .search(&input) .map(|m| Match::new(haystack, m.start(), m.end())) } /// Returns the same as [`Regex::captures`], but starts the search at the /// given offset. /// /// The significance of the starting point is that it takes the surrounding /// context into consideration. For example, the `\A` anchor can only /// match when `start == 0`. /// /// # Panics /// /// This panics when `start >= haystack.len() + 1`. /// /// # Example /// /// This example shows the significance of `start` by demonstrating how it /// can be used to permit look-around assertions in a regex to take the /// surrounding context into account. /// /// ``` /// use regex::Regex; /// /// let re = Regex::new(r"\bchew\b").unwrap(); /// let hay = "eschew"; /// // We get a match here, but it's probably not intended. /// assert_eq!(&re.captures(&hay[2..]).unwrap()[0], "chew"); /// // No match because the assertions take the context into account. /// assert!(re.captures_at(hay, 2).is_none()); /// ``` #[inline] pub fn captures_at<'h>( &self, haystack: &'h str, start: usize, ) -> Option<Captures<'h>> { let input = Input::new(haystack).span(start..haystack.len()); let mut caps = self.meta.create_captures(); self.meta.search_captures(&input, &mut caps); if caps.is_match() { let static_captures_len = self.static_captures_len(); Some(Captures { haystack, caps, static_captures_len }) } else { None } } /// This is like [`Regex::captures`], but writes the byte offsets of each /// capture group match into the locations given. /// /// A [`CaptureLocations`] stores the same byte offsets as a [`Captures`], /// but does *not* store a reference to the haystack. This makes its API /// a bit lower level and less convenient. But in exchange, callers /// may allocate their own `CaptureLocations` and reuse it for multiple /// searches. This may be helpful if allocating a `Captures` shows up in a /// profile as too costly. /// /// To create a `CaptureLocations` value, use the /// [`Regex::capture_locations`] method. /// /// This also returns the overall match if one was found. When a match is /// found, its offsets are also always stored in `locs` at index `0`. /// /// # Panics /// /// This routine may panic if the given `CaptureLocations` was not created /// by this regex. /// /// # Example /// /// ``` /// use regex::Regex; /// /// let re = Regex::new(r"^([a-z]+)=(\S*)$").unwrap(); /// let mut locs = re.capture_locations(); /// assert!(re.captures_read(&mut locs, "id=foo123").is_some()); /// assert_eq!(Some((0, 9)), locs.get(0)); /// assert_eq!(Some((0, 2)), locs.get(1)); /// assert_eq!(Some((3, 9)), locs.get(2)); /// ``` #[inline] pub fn captures_read<'h>( &self, locs: &mut CaptureLocations, haystack: &'h str, ) -> Option<Match<'h>> { self.captures_read_at(locs, haystack, 0) } /// Returns the same as [`Regex::captures_read`], but starts the search at /// the given offset. /// /// The significance of the starting point is that it takes the surrounding /// context into consideration. For example, the `\A` anchor can only /// match when `start == 0`. /// /// # Panics /// /// This panics when `start >= haystack.len() + 1`. /// /// This routine may also panic if the given `CaptureLocations` was not /// created by this regex. /// /// # Example /// /// This example shows the significance of `start` by demonstrating how it /// can be used to permit look-around assertions in a regex to take the /// surrounding context into account. /// /// ``` /// use regex::Regex; /// /// let re = Regex::new(r"\bchew\b").unwrap(); /// let hay = "eschew"; /// let mut locs = re.capture_locations(); /// // We get a match here, but it's probably not intended. /// assert!(re.captures_read(&mut locs, &hay[2..]).is_some()); /// // No match because the assertions take the context into account. /// assert!(re.captures_read_at(&mut locs, hay, 2).is_none()); /// ``` #[inline] pub fn captures_read_at<'h>( &self, locs: &mut CaptureLocations, haystack: &'h str, start: usize, ) -> Option<Match<'h>> { let input = Input::new(haystack).span(start..haystack.len()); self.meta.search_captures(&input, &mut locs.0); locs.0.get_match().map(|m| Match::new(haystack, m.start(), m.end())) } /// An undocumented alias for `captures_read_at`. /// /// The `regex-capi` crate previously used this routine, so to avoid /// breaking that crate, we continue to provide the name as an undocumented /// alias. #[doc(hidden)] #[inline] pub fn read_captures_at<'h>( &self, locs: &mut CaptureLocations, haystack: &'h str, start: usize, ) -> Option<Match<'h>> { self.captures_read_at(locs, haystack, start) } } /// Auxiliary methods. impl Regex { /// Returns the original string of this regex. /// /// # Example /// /// ``` /// use regex::Regex; /// /// let re = Regex::new(r"foo\w+bar").unwrap(); /// assert_eq!(re.as_str(), r"foo\w+bar"); /// ``` #[inline] pub fn as_str(&self) -> &str { &self.pattern } /// Returns an iterator over the capture names in this regex. /// /// The iterator returned yields elements of type `Option<&str>`. That is, /// the iterator yields values for all capture groups, even ones that are /// unnamed. The order of the groups corresponds to the order of the group's /// corresponding opening parenthesis. /// /// The first element of the iterator always yields the group corresponding /// to the overall match, and this group is always unnamed. Therefore, the /// iterator always yields at least one group. /// /// # Example /// /// This shows basic usage with a mix of named and unnamed capture groups: /// /// ``` /// use regex::Regex; /// /// let re = Regex::new(r"(?<a>.(?<b>.))(.)(?:.)(?<c>.)").unwrap(); /// let mut names = re.capture_names(); /// assert_eq!(names.next(), Some(None)); /// assert_eq!(names.next(), Some(Some("a"))); /// assert_eq!(names.next(), Some(Some("b"))); /// assert_eq!(names.next(), Some(None)); /// // the '(?:.)' group is non-capturing and so doesn't appear here! /// assert_eq!(names.next(), Some(Some("c"))); /// assert_eq!(names.next(), None); /// ``` /// /// The iterator always yields at least one element, even for regexes with /// no capture groups and even for regexes that can never match: /// /// ``` /// use regex::Regex; /// /// let re = Regex::new(r"").unwrap(); /// let mut names = re.capture_names(); /// assert_eq!(names.next(), Some(None)); /// assert_eq!(names.next(), None); /// /// let re = Regex::new(r"[a&&b]").unwrap(); /// let mut names = re.capture_names(); /// assert_eq!(names.next(), Some(None)); /// assert_eq!(names.next(), None); /// ``` #[inline] pub fn capture_names(&self) -> CaptureNames<'_> { CaptureNames(self.meta.group_info().pattern_names(PatternID::ZERO)) } /// Returns the number of captures groups in this regex. /// /// This includes all named and unnamed groups, including the implicit /// unnamed group that is always present and corresponds to the entire /// match. /// /// Since the implicit unnamed group is always included in this length, the /// length returned is guaranteed to be greater than zero. /// /// # Example /// /// ``` /// use regex::Regex; /// /// let re = Regex::new(r"foo").unwrap(); /// assert_eq!(1, re.captures_len()); /// /// let re = Regex::new(r"(foo)").unwrap(); /// assert_eq!(2, re.captures_len()); /// /// let re = Regex::new(r"(?<a>.(?<b>.))(.)(?:.)(?<c>.)").unwrap(); /// assert_eq!(5, re.captures_len()); /// /// let re = Regex::new(r"[a&&b]").unwrap(); /// assert_eq!(1, re.captures_len()); /// ``` #[inline] pub fn captures_len(&self) -> usize { self.meta.group_info().group_len(PatternID::ZERO) } /// Returns the total number of capturing groups that appear in every /// possible match. /// /// If the number of capture groups can vary depending on the match, then /// this returns `None`. That is, a value is only returned when the number /// of matching groups is invariant or "static." /// /// Note that like [`Regex::captures_len`], this **does** include the /// implicit capturing group corresponding to the entire match. Therefore, /// when a non-None value is returned, it is guaranteed to be at least `1`. /// Stated differently, a return value of `Some(0)` is impossible. /// /// # Example /// /// This shows a few cases where a static number of capture groups is /// available and a few cases where it is not. /// /// ``` /// use regex::Regex; /// /// let len = |pattern| { /// Regex::new(pattern).map(|re| re.static_captures_len()) /// }; /// /// assert_eq!(Some(1), len("a")?); /// assert_eq!(Some(2), len("(a)")?); /// assert_eq!(Some(2), len("(a)|(b)")?); /// assert_eq!(Some(3), len("(a)(b)|(c)(d)")?); /// assert_eq!(None, len("(a)|b")?); /// assert_eq!(None, len("a|(b)")?); /// assert_eq!(None, len("(b)*")?); /// assert_eq!(Some(2), len("(b)+")?); /// /// # Ok::<(), Box<dyn std::error::Error>>(()) /// ``` #[inline] pub fn static_captures_len(&self) -> Option<usize> { self.meta.static_captures_len() } /// Returns a fresh allocated set of capture locations that can /// be reused in multiple calls to [`Regex::captures_read`] or /// [`Regex::captures_read_at`]. /// /// The returned locations can be used for any subsequent search for this /// particular regex. There is no guarantee that it is correct to use for /// other regexes, even if they have the same number of capture groups. /// /// # Example /// /// ``` /// use regex::Regex; /// /// let re = Regex::new(r"(.)(.)(\w+)").unwrap(); /// let mut locs = re.capture_locations(); /// assert!(re.captures_read(&mut locs, "Padron").is_some()); /// assert_eq!(locs.get(0), Some((0, 6))); /// assert_eq!(locs.get(1), Some((0, 1))); /// assert_eq!(locs.get(2), Some((1, 2))); /// assert_eq!(locs.get(3), Some((2, 6))); /// ``` #[inline] pub fn capture_locations(&self) -> CaptureLocations { CaptureLocations(self.meta.create_captures()) } /// An alias for `capture_locations` to preserve backward compatibility. /// /// The `regex-capi` crate used this method, so to avoid breaking that /// crate, we continue to export it as an undocumented API. #[doc(hidden)] #[inline] pub fn locations(&self) -> CaptureLocations { self.capture_locations() } } /// Represents a single match of a regex in a haystack. /// /// A `Match` contains both the start and end byte offsets of the match and the /// actual substring corresponding to the range of those byte offsets. It is /// guaranteed that `start <= end`. When `start == end`, the match is empty. /// /// Since this `Match` can only be produced by the top-level `Regex` APIs /// that only support searching UTF-8 encoded strings, the byte offsets for a /// `Match` are guaranteed to fall on valid UTF-8 codepoint boundaries. That /// is, slicing a `&str` with [`Match::range`] is guaranteed to never panic. /// /// Values with this type are created by [`Regex::find`] or /// [`Regex::find_iter`]. Other APIs can create `Match` values too. For /// example, [`Captures::get`]. /// /// The lifetime parameter `'h` refers to the lifetime of the matched of the /// haystack that this match was produced from. /// /// # Numbering /// /// The byte offsets in a `Match` form a half-open interval. That is, the /// start of the range is inclusive and the end of the range is exclusive. /// For example, given a haystack `abcFOOxyz` and a match of `FOO`, its byte /// offset range starts at `3` and ends at `6`. `3` corresponds to `F` and /// `6` corresponds to `x`, which is one past the end of the match. This /// corresponds to the same kind of slicing that Rust uses. /// /// For more on why this was chosen over other schemes (aside from being /// consistent with how Rust the language works), see [this discussion] and /// [Dijkstra's note on a related topic][note]. /// /// [this discussion]: https://github.com/rust-lang/regex/discussions/866 /// [note]: https://www.cs.utexas.edu/users/EWD/transcriptions/EWD08xx/EWD831.html /// /// # Example /// /// This example shows the value of each of the methods on `Match` for a /// particular search. /// /// ``` /// use regex::Regex; /// /// let re = Regex::new(r"\p{Greek}+").unwrap(); /// let hay = "Greek: ฮฑฮฒฮณฮด"; /// let m = re.find(hay).unwrap(); /// assert_eq!(7, m.start()); /// assert_eq!(15, m.end()); /// assert!(!m.is_empty()); /// assert_eq!(8, m.len()); /// assert_eq!(7..15, m.range()); /// assert_eq!("ฮฑฮฒฮณฮด", m.as_str()); /// ``` #[derive(Copy, Clone, Eq, PartialEq)] pub struct Match<'h> { haystack: &'h str, start: usize, end: usize, } impl<'h> Match<'h> { /// Returns the byte offset of the start of the match in the haystack. The /// start of the match corresponds to the position where the match begins /// and includes the first byte in the match. /// /// It is guaranteed that `Match::start() <= Match::end()`. /// /// This is guaranteed to fall on a valid UTF-8 codepoint boundary. That /// is, it will never be an offset that appears between the UTF-8 code /// units of a UTF-8 encoded Unicode scalar value. Consequently, it is /// always safe to slice the corresponding haystack using this offset. #[inline] pub fn start(&self) -> usize { self.start } /// Returns the byte offset of the end of the match in the haystack. The /// end of the match corresponds to the byte immediately following the last /// byte in the match. This means that `&slice[start..end]` works as one /// would expect. /// /// It is guaranteed that `Match::start() <= Match::end()`. /// /// This is guaranteed to fall on a valid UTF-8 codepoint boundary. That /// is, it will never be an offset that appears between the UTF-8 code /// units of a UTF-8 encoded Unicode scalar value. Consequently, it is /// always safe to slice the corresponding haystack using this offset. #[inline] pub fn end(&self) -> usize { self.end } /// Returns true if and only if this match has a length of zero. /// /// Note that an empty match can only occur when the regex itself can /// match the empty string. Here are some examples of regexes that can /// all match the empty string: `^`, `^$`, `\b`, `a?`, `a*`, `a{0}`, /// `(foo|\d+|quux)?`. #[inline] pub fn is_empty(&self) -> bool { self.start == self.end } /// Returns the length, in bytes, of this match. #[inline] pub fn len(&self) -> usize { self.end - self.start } /// Returns the range over the starting and ending byte offsets of the /// match in the haystack. /// /// It is always correct to slice the original haystack searched with this /// range. That is, because the offsets are guaranteed to fall on valid /// UTF-8 boundaries, the range returned is always valid. #[inline] pub fn range(&self) -> core::ops::Range<usize> { self.start..self.end } /// Returns the substring of the haystack that matched. #[inline] pub fn as_str(&self) -> &'h str { &self.haystack[self.range()] } /// Creates a new match from the given haystack and byte offsets. #[inline] fn new(haystack: &'h str, start: usize, end: usize) -> Match<'h> { Match { haystack, start, end } } } impl<'h> core::fmt::Debug for Match<'h> { fn fmt(&self, f: &mut core::fmt::Formatter) -> core::fmt::Result { f.debug_struct("Match") .field("start", &self.start) .field("end", &self.end) .field("string", &self.as_str()) .finish() } } impl<'h> From<Match<'h>> for &'h str { fn from(m: Match<'h>) -> &'h str { m.as_str() } } impl<'h> From<Match<'h>> for core::ops::Range<usize> { fn from(m: Match<'h>) -> core::ops::Range<usize> { m.range() } } /// Represents the capture groups for a single match. /// /// Capture groups refer to parts of a regex enclosed in parentheses. They /// can be optionally named. The purpose of capture groups is to be able to /// reference different parts of a match based on the original pattern. In /// essence, a `Captures` is a container of [`Match`] values for each group /// that participated in a regex match. Each `Match` can be looked up by either /// its capture group index or name (if it has one). /// /// For example, say you want to match the individual letters in a 5-letter /// word: /// /// ```text /// (?<first>\w)(\w)(?:\w)\w(?<last>\w) /// ``` /// /// This regex has 4 capture groups: /// /// * The group at index `0` corresponds to the overall match. It is always /// present in every match and never has a name. /// * The group at index `1` with name `first` corresponding to the first /// letter. /// * The group at index `2` with no name corresponding to the second letter. /// * The group at index `3` with name `last` corresponding to the fifth and /// last letter. /// /// Notice that `(?:\w)` was not listed above as a capture group despite it /// being enclosed in parentheses. That's because `(?:pattern)` is a special /// syntax that permits grouping but *without* capturing. The reason for not /// treating it as a capture is that tracking and reporting capture groups /// requires additional state that may lead to slower searches. So using as few /// capture groups as possible can help performance. (Although the difference /// in performance of a couple of capture groups is likely immaterial.) /// /// Values with this type are created by [`Regex::captures`] or /// [`Regex::captures_iter`]. /// /// `'h` is the lifetime of the haystack that these captures were matched from. /// /// # Example /// /// ``` /// use regex::Regex; /// /// let re = Regex::new(r"(?<first>\w)(\w)(?:\w)\w(?<last>\w)").unwrap(); /// let caps = re.captures("toady").unwrap(); /// assert_eq!("toady", &caps[0]); /// assert_eq!("t", &caps["first"]); /// assert_eq!("o", &caps[2]); /// assert_eq!("y", &caps["last"]); /// ``` pub struct Captures<'h> { haystack: &'h str, caps: captures::Captures, static_captures_len: Option<usize>, } impl<'h> Captures<'h> { /// Returns the `Match` associated with the capture group at index `i`. If /// `i` does not correspond to a capture group, or if the capture group did /// not participate in the match, then `None` is returned. /// /// When `i == 0`, this is guaranteed to return a non-`None` value. /// /// # Examples /// /// Get the substring that matched with a default of an empty string if the /// group didn't participate in the match: /// /// ``` /// use regex::Regex; /// /// let re = Regex::new(r"[a-z]+(?:([0-9]+)|([A-Z]+))").unwrap(); /// let caps = re.captures("abc123").unwrap(); /// /// let substr1 = caps.get(1).map_or("", |m| m.as_str()); /// let substr2 = caps.get(2).map_or("", |m| m.as_str()); /// assert_eq!(substr1, "123"); /// assert_eq!(substr2, ""); /// ``` #[inline] pub fn get(&self, i: usize) -> Option<Match<'h>> { self.caps .get_group(i) .map(|sp| Match::new(self.haystack, sp.start, sp.end)) } /// Return the overall match for the capture. /// /// This returns the match for index `0`. That is it is equivalent to /// `m.get(0).unwrap()` /// /// # Example /// /// ``` /// use regex::Regex; /// /// let re = Regex::new(r"[a-z]+([0-9]+)").unwrap(); /// let caps = re.captures(" abc123-def").unwrap(); /// /// assert_eq!(caps.get_match().as_str(), "abc123"); /// /// ``` #[inline] pub fn get_match(&self) -> Match<'h> { self.get(0).unwrap() } /// Returns the `Match` associated with the capture group named `name`. If /// `name` isn't a valid capture group or it refers to a group that didn't /// match, then `None` is returned. /// /// Note that unlike `caps["name"]`, this returns a `Match` whose lifetime /// matches the lifetime of the haystack in this `Captures` value. /// Conversely, the substring returned by `caps["name"]` has a lifetime /// of the `Captures` value, which is likely shorter than the lifetime of /// the haystack. In some cases, it may be necessary to use this method to /// access the matching substring instead of the `caps["name"]` notation. /// /// # Examples /// /// Get the substring that matched with a default of an empty string if the /// group didn't participate in the match: /// /// ``` /// use regex::Regex; /// /// let re = Regex::new( /// r"[a-z]+(?:(?<numbers>[0-9]+)|(?<letters>[A-Z]+))", /// ).unwrap(); /// let caps = re.captures("abc123").unwrap(); /// /// let numbers = caps.name("numbers").map_or("", |m| m.as_str()); /// let letters = caps.name("letters").map_or("", |m| m.as_str()); /// assert_eq!(numbers, "123"); /// assert_eq!(letters, ""); /// ``` #[inline] pub fn name(&self, name: &str) -> Option<Match<'h>> { self.caps .get_group_by_name(name) .map(|sp| Match::new(self.haystack, sp.start, sp.end)) } /// This is a convenience routine for extracting the substrings /// corresponding to matching capture groups. /// /// This returns a tuple where the first element corresponds to the full /// substring of the haystack that matched the regex. The second element is /// an array of substrings, with each corresponding to the substring that /// matched for a particular capture group. /// /// # Panics /// /// This panics if the number of possible matching groups in this /// `Captures` value is not fixed to `N` in all circumstances. /// More precisely, this routine only works when `N` is equivalent to /// [`Regex::static_captures_len`]. /// /// Stated more plainly, if the number of matching capture groups in a /// regex can vary from match to match, then this function always panics. /// /// For example, `(a)(b)|(c)` could produce two matching capture groups /// or one matching capture group for any given match. Therefore, one /// cannot use `extract` with such a pattern. /// /// But a pattern like `(a)(b)|(c)(d)` can be used with `extract` because /// the number of capture groups in every match is always equivalent, /// even if the capture _indices_ in each match are not. /// /// # Example /// /// ``` /// use regex::Regex; /// /// let re = Regex::new(r"([0-9]{4})-([0-9]{2})-([0-9]{2})").unwrap(); /// let hay = "On 2010-03-14, I became a Tennessee lamb."; /// let Some((full, [year, month, day])) = /// re.captures(hay).map(|caps| caps.extract()) else { return }; /// assert_eq!("2010-03-14", full); /// assert_eq!("2010", year); /// assert_eq!("03", month); /// assert_eq!("14", day); /// ``` /// /// # Example: iteration /// /// This example shows how to use this method when iterating over all /// `Captures` matches in a haystack. /// /// ``` /// use regex::Regex; /// /// let re = Regex::new(r"([0-9]{4})-([0-9]{2})-([0-9]{2})").unwrap(); /// let hay = "1973-01-05, 1975-08-25 and 1980-10-18"; /// /// let mut dates: Vec<(&str, &str, &str)> = vec![]; /// for (_, [y, m, d]) in re.captures_iter(hay).map(|c| c.extract()) { /// dates.push((y, m, d)); /// } /// assert_eq!(dates, vec![ /// ("1973", "01", "05"), /// ("1975", "08", "25"), /// ("1980", "10", "18"), /// ]); /// ``` /// /// # Example: parsing different formats /// /// This API is particularly useful when you need to extract a particular /// value that might occur in a different format. Consider, for example, /// an identifier that might be in double quotes or single quotes: /// /// ``` /// use regex::Regex; /// /// let re = Regex::new(r#"id:(?:"([^"]+)"|'([^']+)')"#).unwrap(); /// let hay = r#"The first is id:"foo" and the second is id:'bar'."#; /// let mut ids = vec![]; /// for (_, [id]) in re.captures_iter(hay).map(|c| c.extract()) { /// ids.push(id); /// } /// assert_eq!(ids, vec!["foo", "bar"]); /// ``` pub fn extract<const N: usize>(&self) -> (&'h str, [&'h str; N]) { let len = self .static_captures_len .expect("number of capture groups can vary in a match") .checked_sub(1) .expect("number of groups is always greater than zero"); assert_eq!(N, len, "asked for {N} groups, but must ask for {len}"); // The regex-automata variant of extract is a bit more permissive. // It doesn't require the number of matching capturing groups to be // static, and you can even request fewer groups than what's there. So // this is guaranteed to never panic because we've asserted above that // the user has requested precisely the number of groups that must be // present in any match for this regex. self.caps.extract(self.haystack) } /// Expands all instances of `$ref` in `replacement` to the corresponding /// capture group, and writes them to the `dst` buffer given. A `ref` can /// be a capture group index or a name. If `ref` doesn't refer to a capture /// group that participated in the match, then it is replaced with the /// empty string. /// /// # Format /// /// The format of the replacement string supports two different kinds of /// capture references: unbraced and braced. /// /// For the unbraced format, the format supported is `$ref` where `name` /// can be any character in the class `[0-9A-Za-z_]`. `ref` is always /// the longest possible parse. So for example, `$1a` corresponds to the /// capture group named `1a` and not the capture group at index `1`. If /// `ref` matches `^[0-9]+$`, then it is treated as a capture group index /// itself and not a name. /// /// For the braced format, the format supported is `${ref}` where `ref` can /// be any sequence of bytes except for `}`. If no closing brace occurs, /// then it is not considered a capture reference. As with the unbraced /// format, if `ref` matches `^[0-9]+$`, then it is treated as a capture /// group index and not a name. /// /// The braced format is useful for exerting precise control over the name /// of the capture reference. For example, `${1}a` corresponds to the /// capture group reference `1` followed by the letter `a`, where as `$1a` /// (as mentioned above) corresponds to the capture group reference `1a`. /// The braced format is also useful for expressing capture group names /// that use characters not supported by the unbraced format. For example, /// `${foo[bar].baz}` refers to the capture group named `foo[bar].baz`. /// /// If a capture group reference is found and it does not refer to a valid /// capture group, then it will be replaced with the empty string. /// /// To write a literal `$`, use `$$`. /// /// # Example /// /// ``` /// use regex::Regex; /// /// let re = Regex::new( /// r"(?<day>[0-9]{2})-(?<month>[0-9]{2})-(?<year>[0-9]{4})", /// ).unwrap(); /// let hay = "On 14-03-2010, I became a Tennessee lamb."; /// let caps = re.captures(hay).unwrap(); /// /// let mut dst = String::new(); /// caps.expand("year=$year, month=$month, day=$day", &mut dst); /// assert_eq!(dst, "year=2010, month=03, day=14"); /// ``` #[inline] pub fn expand(&self, replacement: &str, dst: &mut String) { self.caps.interpolate_string_into(self.haystack, replacement, dst); } /// Returns an iterator over all capture groups. This includes both /// matching and non-matching groups. /// /// The iterator always yields at least one matching group: the first group /// (at index `0`) with no name. Subsequent groups are returned in the order /// of their opening parenthesis in the regex. /// /// The elements yielded have type `Option<Match<'h>>`, where a non-`None` /// value is present if the capture group matches. /// /// # Example /// /// ``` /// use regex::Regex; /// /// let re = Regex::new(r"(\w)(\d)?(\w)").unwrap(); /// let caps = re.captures("AZ").unwrap(); /// /// let mut it = caps.iter(); /// assert_eq!(it.next().unwrap().map(|m| m.as_str()), Some("AZ")); /// assert_eq!(it.next().unwrap().map(|m| m.as_str()), Some("A")); /// assert_eq!(it.next().unwrap().map(|m| m.as_str()), None); /// assert_eq!(it.next().unwrap().map(|m| m.as_str()), Some("Z")); /// assert_eq!(it.next(), None); /// ``` #[inline] pub fn iter<'c>(&'c self) -> SubCaptureMatches<'c, 'h> { SubCaptureMatches { haystack: self.haystack, it: self.caps.iter() } } /// Returns the total number of capture groups. This includes both /// matching and non-matching groups. /// /// The length returned is always equivalent to the number of elements /// yielded by [`Captures::iter`]. Consequently, the length is always /// greater than zero since every `Captures` value always includes the /// match for the entire regex. /// /// # Example /// /// ``` /// use regex::Regex; /// /// let re = Regex::new(r"(\w)(\d)?(\w)").unwrap(); /// let caps = re.captures("AZ").unwrap(); /// assert_eq!(caps.len(), 4); /// ``` #[inline] pub fn len(&self) -> usize { self.caps.group_len() } } impl<'h> core::fmt::Debug for Captures<'h> { fn fmt(&self, f: &mut core::fmt::Formatter<'_>) -> core::fmt::Result { /// A little helper type to provide a nice map-like debug /// representation for our capturing group spans. /// /// regex-automata has something similar, but it includes the pattern /// ID in its debug output, which is confusing. It also doesn't include /// that strings that match because a regex-automata `Captures` doesn't /// borrow the haystack. struct CapturesDebugMap<'a> { caps: &'a Captures<'a>, } impl<'a> core::fmt::Debug for CapturesDebugMap<'a> { fn fmt(&self, f: &mut core::fmt::Formatter) -> core::fmt::Result { let mut map = f.debug_map(); let names = self.caps.caps.group_info().pattern_names(PatternID::ZERO); for (group_index, maybe_name) in names.enumerate() { let key = Key(group_index, maybe_name); match self.caps.get(group_index) { None => map.entry(&key, &None::<()>), Some(mat) => map.entry(&key, &Value(mat)), }; } map.finish() } } struct Key<'a>(usize, Option<&'a str>); impl<'a> core::fmt::Debug for Key<'a> { fn fmt(&self, f: &mut core::fmt::Formatter) -> core::fmt::Result { write!(f, "{}", self.0)?; if let Some(name) = self.1 { write!(f, "/{name:?}")?; } Ok(()) } } struct Value<'a>(Match<'a>); impl<'a> core::fmt::Debug for Value<'a> { fn fmt(&self, f: &mut core::fmt::Formatter) -> core::fmt::Result { write!( f, "{}..{}/{:?}", self.0.start(), self.0.end(), self.0.as_str() ) } } f.debug_tuple("Captures") .field(&CapturesDebugMap { caps: self }) .finish() } } /// Get a matching capture group's haystack substring by index. /// /// The haystack substring returned can't outlive the `Captures` object if this /// method is used, because of how `Index` is defined (normally `a[i]` is part /// of `a` and can't outlive it). To work around this limitation, do that, use /// [`Captures::get`] instead. /// /// `'h` is the lifetime of the matched haystack, but the lifetime of the /// `&str` returned by this implementation is the lifetime of the `Captures` /// value itself. /// /// # Panics /// /// If there is no matching group at the given index. impl<'h> core::ops::Index<usize> for Captures<'h> { type Output = str; // The lifetime is written out to make it clear that the &str returned // does NOT have a lifetime equivalent to 'h. fn index<'a>(&'a self, i: usize) -> &'a str { self.get(i) .map(|m| m.as_str()) .unwrap_or_else(|| panic!("no group at index '{i}'")) } } /// Get a matching capture group's haystack substring by name. /// /// The haystack substring returned can't outlive the `Captures` object if this /// method is used, because of how `Index` is defined (normally `a[i]` is part /// of `a` and can't outlive it). To work around this limitation, do that, use /// [`Captures::name`] instead. /// /// `'h` is the lifetime of the matched haystack, but the lifetime of the /// `&str` returned by this implementation is the lifetime of the `Captures` /// value itself. /// /// `'n` is the lifetime of the group name used to index the `Captures` value. /// /// # Panics /// /// If there is no matching group at the given name. impl<'h, 'n> core::ops::Index<&'n str> for Captures<'h> { type Output = str; fn index<'a>(&'a self, name: &'n str) -> &'a str { self.name(name) .map(|m| m.as_str()) .unwrap_or_else(|| panic!("no group named '{name}'")) } } /// A low level representation of the byte offsets of each capture group. /// /// You can think of this as a lower level [`Captures`], where this type does /// not support named capturing groups directly and it does not borrow the /// haystack that these offsets were matched on. /// /// Primarily, this type is useful when using the lower level `Regex` APIs such /// as [`Regex::captures_read`], which permits amortizing the allocation in /// which capture match offsets are stored. /// /// In order to build a value of this type, you'll need to call the /// [`Regex::capture_locations`] method. The value returned can then be reused /// in subsequent searches for that regex. Using it for other regexes may /// result in a panic or otherwise incorrect results. /// /// # Example /// /// This example shows how to create and use `CaptureLocations` in a search. /// /// ``` /// use regex::Regex; /// /// let re = Regex::new(r"(?<first>\w+)\s+(?<last>\w+)").unwrap(); /// let mut locs = re.capture_locations(); /// let m = re.captures_read(&mut locs, "Bruce Springsteen").unwrap(); /// assert_eq!(0..17, m.range()); /// assert_eq!(Some((0, 17)), locs.get(0)); /// assert_eq!(Some((0, 5)), locs.get(1)); /// assert_eq!(Some((6, 17)), locs.get(2)); /// /// // Asking for an invalid capture group always returns None. /// assert_eq!(None, locs.get(3)); /// # // literals are too big for 32-bit usize: #1041 /// # #[cfg(target_pointer_width = "64")] /// assert_eq!(None, locs.get(34973498648)); /// # #[cfg(target_pointer_width = "64")] /// assert_eq!(None, locs.get(9944060567225171988)); /// ``` #[derive(Clone, Debug)] pub struct CaptureLocations(captures::Captures); /// A type alias for `CaptureLocations` for backwards compatibility. /// /// Previously, we exported `CaptureLocations` as `Locations` in an /// undocumented API. To prevent breaking that code (e.g., in `regex-capi`), /// we continue re-exporting the same undocumented API. #[doc(hidden)] pub type Locations = CaptureLocations; impl CaptureLocations { /// Returns the start and end byte offsets of the capture group at index /// `i`. This returns `None` if `i` is not a valid capture group or if the /// capture group did not match. /// /// # Example /// /// ``` /// use regex::Regex; /// /// let re = Regex::new(r"(?<first>\w+)\s+(?<last>\w+)").unwrap(); /// let mut locs = re.capture_locations(); /// re.captures_read(&mut locs, "Bruce Springsteen").unwrap(); /// assert_eq!(Some((0, 17)), locs.get(0)); /// assert_eq!(Some((0, 5)), locs.get(1)); /// assert_eq!(Some((6, 17)), locs.get(2)); /// ``` #[inline] pub fn get(&self, i: usize) -> Option<(usize, usize)> { self.0.get_group(i).map(|sp| (sp.start, sp.end)) } /// Returns the total number of capture groups (even if they didn't match). /// That is, the length returned is unaffected by the result of a search. /// /// This is always at least `1` since every regex has at least `1` /// capturing group that corresponds to the entire match. /// /// # Example /// /// ``` /// use regex::Regex; /// /// let re = Regex::new(r"(?<first>\w+)\s+(?<last>\w+)").unwrap(); /// let mut locs = re.capture_locations(); /// assert_eq!(3, locs.len()); /// re.captures_read(&mut locs, "Bruce Springsteen").unwrap(); /// assert_eq!(3, locs.len()); /// ``` /// /// Notice that the length is always at least `1`, regardless of the regex: /// /// ``` /// use regex::Regex; /// /// let re = Regex::new(r"").unwrap(); /// let locs = re.capture_locations(); /// assert_eq!(1, locs.len()); /// /// // [a&&b] is a regex that never matches anything. /// let re = Regex::new(r"[a&&b]").unwrap(); /// let locs = re.capture_locations(); /// assert_eq!(1, locs.len()); /// ``` #[inline] pub fn len(&self) -> usize { // self.0.group_len() returns 0 if the underlying captures doesn't // represent a match, but the behavior guaranteed for this method is // that the length doesn't change based on a match or not. self.0.group_info().group_len(PatternID::ZERO) } /// An alias for the `get` method for backwards compatibility. /// /// Previously, we exported `get` as `pos` in an undocumented API. To /// prevent breaking that code (e.g., in `regex-capi`), we continue /// re-exporting the same undocumented API. #[doc(hidden)] #[inline] pub fn pos(&self, i: usize) -> Option<(usize, usize)> { self.get(i) } } /// An iterator over all non-overlapping matches in a haystack. /// /// This iterator yields [`Match`] values. The iterator stops when no more /// matches can be found. /// /// `'r` is the lifetime of the compiled regular expression and `'h` is the /// lifetime of the haystack. /// /// This iterator is created by [`Regex::find_iter`]. /// /// # Time complexity /// /// Note that since an iterator runs potentially many searches on the haystack /// and since each search has worst case `O(m * n)` time complexity, the /// overall worst case time complexity for iteration is `O(m * n^2)`. #[derive(Debug)] pub struct Matches<'r, 'h> { haystack: &'h str, it: meta::FindMatches<'r, 'h>, } impl<'r, 'h> Iterator for Matches<'r, 'h> { type Item = Match<'h>; #[inline] fn next(&mut self) -> Option<Match<'h>> { self.it .next() .map(|sp| Match::new(self.haystack, sp.start(), sp.end())) } #[inline] fn count(self) -> usize { // This can actually be up to 2x faster than calling `next()` until // completion, because counting matches when using a DFA only requires // finding the end of each match. But returning a `Match` via `next()` // requires the start of each match which, with a DFA, requires a // reverse forward scan to find it. self.it.count() } } impl<'r, 'h> core::iter::FusedIterator for Matches<'r, 'h> {} /// An iterator over all non-overlapping capture matches in a haystack. /// /// This iterator yields [`Captures`] values. The iterator stops when no more /// matches can be found. /// /// `'r` is the lifetime of the compiled regular expression and `'h` is the /// lifetime of the matched string. /// /// This iterator is created by [`Regex::captures_iter`]. /// /// # Time complexity /// /// Note that since an iterator runs potentially many searches on the haystack /// and since each search has worst case `O(m * n)` time complexity, the /// overall worst case time complexity for iteration is `O(m * n^2)`. #[derive(Debug)] pub struct CaptureMatches<'r, 'h> { haystack: &'h str, it: meta::CapturesMatches<'r, 'h>, } impl<'r, 'h> Iterator for CaptureMatches<'r, 'h> { type Item = Captures<'h>; #[inline] fn next(&mut self) -> Option<Captures<'h>> { let static_captures_len = self.it.regex().static_captures_len(); self.it.next().map(|caps| Captures { haystack: self.haystack, caps, static_captures_len, }) } #[inline] fn count(self) -> usize { // This can actually be up to 2x faster than calling `next()` until // completion, because counting matches when using a DFA only requires // finding the end of each match. But returning a `Match` via `next()` // requires the start of each match which, with a DFA, requires a // reverse forward scan to find it. self.it.count() } } impl<'r, 'h> core::iter::FusedIterator for CaptureMatches<'r, 'h> {} /// An iterator over all substrings delimited by a regex match. /// /// `'r` is the lifetime of the compiled regular expression and `'h` is the /// lifetime of the byte string being split. /// /// This iterator is created by [`Regex::split`]. /// /// # Time complexity /// /// Note that since an iterator runs potentially many searches on the haystack /// and since each search has worst case `O(m * n)` time complexity, the /// overall worst case time complexity for iteration is `O(m * n^2)`. #[derive(Debug)] pub struct Split<'r, 'h> { haystack: &'h str, it: meta::Split<'r, 'h>, } impl<'r, 'h> Iterator for Split<'r, 'h> { type Item = &'h str; #[inline] fn next(&mut self) -> Option<&'h str> { self.it.next().map(|span| &self.haystack[span]) } } impl<'r, 'h> core::iter::FusedIterator for Split<'r, 'h> {} /// An iterator over at most `N` substrings delimited by a regex match. /// /// The last substring yielded by this iterator will be whatever remains after /// `N-1` splits. /// /// `'r` is the lifetime of the compiled regular expression and `'h` is the /// lifetime of the byte string being split. /// /// This iterator is created by [`Regex::splitn`]. /// /// # Time complexity /// /// Note that since an iterator runs potentially many searches on the haystack /// and since each search has worst case `O(m * n)` time complexity, the /// overall worst case time complexity for iteration is `O(m * n^2)`. /// /// Although note that the worst case time here has an upper bound given /// by the `limit` parameter to [`Regex::splitn`]. #[derive(Debug)] pub struct SplitN<'r, 'h> { haystack: &'h str, it: meta::SplitN<'r, 'h>, } impl<'r, 'h> Iterator for SplitN<'r, 'h> { type Item = &'h str; #[inline] fn next(&mut self) -> Option<&'h str> { self.it.next().map(|span| &self.haystack[span]) } #[inline] fn size_hint(&self) -> (usize, Option<usize>) { self.it.size_hint() } } impl<'r, 'h> core::iter::FusedIterator for SplitN<'r, 'h> {} /// An iterator over the names of all capture groups in a regex. /// /// This iterator yields values of type `Option<&str>` in order of the opening /// capture group parenthesis in the regex pattern. `None` is yielded for /// groups with no name. The first element always corresponds to the implicit /// and unnamed group for the overall match. /// /// `'r` is the lifetime of the compiled regular expression. /// /// This iterator is created by [`Regex::capture_names`]. #[derive(Clone, Debug)] pub struct CaptureNames<'r>(captures::GroupInfoPatternNames<'r>); impl<'r> Iterator for CaptureNames<'r> { type Item = Option<&'r str>; #[inline] fn next(&mut self) -> Option<Option<&'r str>> { self.0.next() } #[inline] fn size_hint(&self) -> (usize, Option<usize>) { self.0.size_hint() } #[inline] fn count(self) -> usize { self.0.count() } } impl<'r> ExactSizeIterator for CaptureNames<'r> {} impl<'r> core::iter::FusedIterator for CaptureNames<'r> {} /// An iterator over all group matches in a [`Captures`] value. /// /// This iterator yields values of type `Option<Match<'h>>`, where `'h` is the /// lifetime of the haystack that the matches are for. The order of elements /// yielded corresponds to the order of the opening parenthesis for the group /// in the regex pattern. `None` is yielded for groups that did not participate /// in the match. /// /// The first element always corresponds to the implicit group for the overall /// match. Since this iterator is created by a [`Captures`] value, and a /// `Captures` value is only created when a match occurs, it follows that the /// first element yielded by this iterator is guaranteed to be non-`None`. /// /// The lifetime `'c` corresponds to the lifetime of the `Captures` value that /// created this iterator, and the lifetime `'h` corresponds to the originally /// matched haystack. #[derive(Clone, Debug)] pub struct SubCaptureMatches<'c, 'h> { haystack: &'h str, it: captures::CapturesPatternIter<'c>, } impl<'c, 'h> Iterator for SubCaptureMatches<'c, 'h> { type Item = Option<Match<'h>>; #[inline] fn next(&mut self) -> Option<Option<Match<'h>>> { self.it.next().map(|group| { group.map(|sp| Match::new(self.haystack, sp.start, sp.end)) }) } #[inline] fn size_hint(&self) -> (usize, Option<usize>) { self.it.size_hint() } #[inline] fn count(self) -> usize { self.it.count() } } impl<'c, 'h> ExactSizeIterator for SubCaptureMatches<'c, 'h> {} impl<'c, 'h> core::iter::FusedIterator for SubCaptureMatches<'c, 'h> {} /// A trait for types that can be used to replace matches in a haystack. /// /// In general, users of this crate shouldn't need to implement this trait, /// since implementations are already provided for `&str` along with other /// variants of string types, as well as `FnMut(&Captures) -> String` (or any /// `FnMut(&Captures) -> T` where `T: AsRef<str>`). Those cover most use cases, /// but callers can implement this trait directly if necessary. /// /// # Example /// /// This example shows a basic implementation of the `Replacer` trait. This /// can be done much more simply using the replacement string interpolation /// support (e.g., `$first $last`), but this approach avoids needing to parse /// the replacement string at all. /// /// ``` /// use regex::{Captures, Regex, Replacer}; /// /// struct NameSwapper; /// /// impl Replacer for NameSwapper { /// fn replace_append(&mut self, caps: &Captures<'_>, dst: &mut String) { /// dst.push_str(&caps["first"]); /// dst.push_str(" "); /// dst.push_str(&caps["last"]); /// } /// } /// /// let re = Regex::new(r"(?<last>[^,\s]+),\s+(?<first>\S+)").unwrap(); /// let result = re.replace("Springsteen, Bruce", NameSwapper); /// assert_eq!(result, "Bruce Springsteen"); /// ``` pub trait Replacer { /// Appends possibly empty data to `dst` to replace the current match. /// /// The current match is represented by `caps`, which is guaranteed to /// have a match at capture group `0`. /// /// For example, a no-op replacement would be `dst.push_str(&caps[0])`. fn replace_append(&mut self, caps: &Captures<'_>, dst: &mut String); /// Return a fixed unchanging replacement string. /// /// When doing replacements, if access to [`Captures`] is not needed (e.g., /// the replacement string does not need `$` expansion), then it can be /// beneficial to avoid finding sub-captures. /// /// In general, this is called once for every call to a replacement routine /// such as [`Regex::replace_all`]. fn no_expansion<'r>(&'r mut self) -> Option<Cow<'r, str>> { None } /// Returns a type that implements `Replacer`, but that borrows and wraps /// this `Replacer`. /// /// This is useful when you want to take a generic `Replacer` (which might /// not be cloneable) and use it without consuming it, so it can be used /// more than once. /// /// # Example /// /// ``` /// use regex::{Regex, Replacer}; /// /// fn replace_all_twice<R: Replacer>( /// re: Regex, /// src: &str, /// mut rep: R, /// ) -> String { /// let dst = re.replace_all(src, rep.by_ref()); /// let dst = re.replace_all(&dst, rep.by_ref()); /// dst.into_owned() /// } /// ``` fn by_ref<'r>(&'r mut self) -> ReplacerRef<'r, Self> { ReplacerRef(self) } } impl<'a> Replacer for &'a str { fn replace_append(&mut self, caps: &Captures<'_>, dst: &mut String) { caps.expand(*self, dst); } fn no_expansion(&mut self) -> Option<Cow<'_, str>> { no_expansion(self) } } impl<'a> Replacer for &'a String { fn replace_append(&mut self, caps: &Captures<'_>, dst: &mut String) { self.as_str().replace_append(caps, dst) } fn no_expansion(&mut self) -> Option<Cow<'_, str>> { no_expansion(self) } } impl Replacer for String { fn replace_append(&mut self, caps: &Captures<'_>, dst: &mut String) { self.as_str().replace_append(caps, dst) } fn no_expansion(&mut self) -> Option<Cow<'_, str>> { no_expansion(self) } } impl<'a> Replacer for Cow<'a, str> { fn replace_append(&mut self, caps: &Captures<'_>, dst: &mut String) { self.as_ref().replace_append(caps, dst) } fn no_expansion(&mut self) -> Option<Cow<'_, str>> { no_expansion(self) } } impl<'a> Replacer for &'a Cow<'a, str> { fn replace_append(&mut self, caps: &Captures<'_>, dst: &mut String) { self.as_ref().replace_append(caps, dst) } fn no_expansion(&mut self) -> Option<Cow<'_, str>> { no_expansion(self) } } impl<F, T> Replacer for F where F: FnMut(&Captures<'_>) -> T, T: AsRef<str>, { fn replace_append(&mut self, caps: &Captures<'_>, dst: &mut String) { dst.push_str((*self)(caps).as_ref()); } } /// A by-reference adaptor for a [`Replacer`]. /// /// This permits reusing the same `Replacer` value in multiple calls to a /// replacement routine like [`Regex::replace_all`]. /// /// This type is created by [`Replacer::by_ref`]. #[derive(Debug)] pub struct ReplacerRef<'a, R: ?Sized>(&'a mut R); impl<'a, R: Replacer + ?Sized + 'a> Replacer for ReplacerRef<'a, R> { fn replace_append(&mut self, caps: &Captures<'_>, dst: &mut String) { self.0.replace_append(caps, dst) } fn no_expansion(&mut self) -> Option<Cow<'_, str>> { self.0.no_expansion() } } /// A helper type for forcing literal string replacement. /// /// It can be used with routines like [`Regex::replace`] and /// [`Regex::replace_all`] to do a literal string replacement without expanding /// `$name` to their corresponding capture groups. This can be both convenient /// (to avoid escaping `$`, for example) and faster (since capture groups /// don't need to be found). /// /// `'s` is the lifetime of the literal string to use. /// /// # Example /// /// ``` /// use regex::{NoExpand, Regex}; /// /// let re = Regex::new(r"(?<last>[^,\s]+),\s+(\S+)").unwrap(); /// let result = re.replace("Springsteen, Bruce", NoExpand("$2 $last")); /// assert_eq!(result, "$2 $last"); /// ``` #[derive(Clone, Debug)] pub struct NoExpand<'s>(pub &'s str); impl<'s> Replacer for NoExpand<'s> { fn replace_append(&mut self, _: &Captures<'_>, dst: &mut String) { dst.push_str(self.0); } fn no_expansion(&mut self) -> Option<Cow<'_, str>> { Some(Cow::Borrowed(self.0)) } } /// Quickly checks the given replacement string for whether interpolation /// should be done on it. It returns `None` if a `$` was found anywhere in the /// given string, which suggests interpolation needs to be done. But if there's /// no `$` anywhere, then interpolation definitely does not need to be done. In /// that case, the given string is returned as a borrowed `Cow`. /// /// This is meant to be used to implement the [`Replacer::no_expansion`] method /// in its various trait impls. fn no_expansion<T: AsRef<str>>(replacement: &T) -> Option<Cow<'_, str>> { let replacement = replacement.as_ref(); match crate::find_byte::find_byte(b'$', replacement.as_bytes()) { Some(_) => None, None => Some(Cow::Borrowed(replacement)), } } ����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������regex-1.12.2/src/regexset/bytes.rs������������������������������������������������������������������0000644�0000000�0000000�00000057472�10461020230�0015151�0����������������������������������������������������������������������������������������������������ustar �����������������������������������������������������������������0000000�0000000������������������������������������������������������������������������������������������������������������������������������������������������������������������������use alloc::string::String; use regex_automata::{meta, Input, PatternID, PatternSet, PatternSetIter}; use crate::{bytes::RegexSetBuilder, Error}; /// Match multiple, possibly overlapping, regexes in a single search. /// /// A regex set corresponds to the union of zero or more regular expressions. /// That is, a regex set will match a haystack when at least one of its /// constituent regexes matches. A regex set as its formulated here provides a /// touch more power: it will also report *which* regular expressions in the /// set match. Indeed, this is the key difference between regex sets and a /// single `Regex` with many alternates, since only one alternate can match at /// a time. /// /// For example, consider regular expressions to match email addresses and /// domains: `[a-z]+@[a-z]+\.(com|org|net)` and `[a-z]+\.(com|org|net)`. If a /// regex set is constructed from those regexes, then searching the haystack /// `foo@example.com` will report both regexes as matching. Of course, one /// could accomplish this by compiling each regex on its own and doing two /// searches over the haystack. The key advantage of using a regex set is /// that it will report the matching regexes using a *single pass through the /// haystack*. If one has hundreds or thousands of regexes to match repeatedly /// (like a URL router for a complex web application or a user agent matcher), /// then a regex set *can* realize huge performance gains. /// /// Unlike the top-level [`RegexSet`](crate::RegexSet), this `RegexSet` /// searches haystacks with type `&[u8]` instead of `&str`. Consequently, this /// `RegexSet` is permitted to match invalid UTF-8. /// /// # Limitations /// /// Regex sets are limited to answering the following two questions: /// /// 1. Does any regex in the set match? /// 2. If so, which regexes in the set match? /// /// As with the main [`Regex`][crate::bytes::Regex] type, it is cheaper to ask /// (1) instead of (2) since the matching engines can stop after the first /// match is found. /// /// You cannot directly extract [`Match`][crate::bytes::Match] or /// [`Captures`][crate::bytes::Captures] objects from a regex set. If you need /// these operations, the recommended approach is to compile each pattern in /// the set independently and scan the exact same haystack a second time with /// those independently compiled patterns: /// /// ``` /// use regex::bytes::{Regex, RegexSet}; /// /// let patterns = ["foo", "bar"]; /// // Both patterns will match different ranges of this string. /// let hay = b"barfoo"; /// /// // Compile a set matching any of our patterns. /// let set = RegexSet::new(patterns).unwrap(); /// // Compile each pattern independently. /// let regexes: Vec<_> = set /// .patterns() /// .iter() /// .map(|pat| Regex::new(pat).unwrap()) /// .collect(); /// /// // Match against the whole set first and identify the individual /// // matching patterns. /// let matches: Vec<&[u8]> = set /// .matches(hay) /// .into_iter() /// // Dereference the match index to get the corresponding /// // compiled pattern. /// .map(|index| ®exes[index]) /// // To get match locations or any other info, we then have to search the /// // exact same haystack again, using our separately-compiled pattern. /// .map(|re| re.find(hay).unwrap().as_bytes()) /// .collect(); /// /// // Matches arrive in the order the constituent patterns were declared, /// // not the order they appear in the haystack. /// assert_eq!(vec![&b"foo"[..], &b"bar"[..]], matches); /// ``` /// /// # Performance /// /// A `RegexSet` has the same performance characteristics as `Regex`. Namely, /// search takes `O(m * n)` time, where `m` is proportional to the size of the /// regex set and `n` is proportional to the length of the haystack. /// /// # Trait implementations /// /// The `Default` trait is implemented for `RegexSet`. The default value /// is an empty set. An empty set can also be explicitly constructed via /// [`RegexSet::empty`]. /// /// # Example /// /// This shows how the above two regexes (for matching email addresses and /// domains) might work: /// /// ``` /// use regex::bytes::RegexSet; /// /// let set = RegexSet::new(&[ /// r"[a-z]+@[a-z]+\.(com|org|net)", /// r"[a-z]+\.(com|org|net)", /// ]).unwrap(); /// /// // Ask whether any regexes in the set match. /// assert!(set.is_match(b"foo@example.com")); /// /// // Identify which regexes in the set match. /// let matches: Vec<_> = set.matches(b"foo@example.com").into_iter().collect(); /// assert_eq!(vec![0, 1], matches); /// /// // Try again, but with a haystack that only matches one of the regexes. /// let matches: Vec<_> = set.matches(b"example.com").into_iter().collect(); /// assert_eq!(vec![1], matches); /// /// // Try again, but with a haystack that doesn't match any regex in the set. /// let matches: Vec<_> = set.matches(b"example").into_iter().collect(); /// assert!(matches.is_empty()); /// ``` /// /// Note that it would be possible to adapt the above example to using `Regex` /// with an expression like: /// /// ```text /// (?P<email>[a-z]+@(?P<email_domain>[a-z]+[.](com|org|net)))|(?P<domain>[a-z]+[.](com|org|net)) /// ``` /// /// After a match, one could then inspect the capture groups to figure out /// which alternates matched. The problem is that it is hard to make this /// approach scale when there are many regexes since the overlap between each /// alternate isn't always obvious to reason about. #[derive(Clone)] pub struct RegexSet { pub(crate) meta: meta::Regex, pub(crate) patterns: alloc::sync::Arc<[String]>, } impl RegexSet { /// Create a new regex set with the given regular expressions. /// /// This takes an iterator of `S`, where `S` is something that can produce /// a `&str`. If any of the strings in the iterator are not valid regular /// expressions, then an error is returned. /// /// # Example /// /// Create a new regex set from an iterator of strings: /// /// ``` /// use regex::bytes::RegexSet; /// /// let set = RegexSet::new([r"\w+", r"\d+"]).unwrap(); /// assert!(set.is_match(b"foo")); /// ``` pub fn new<I, S>(exprs: I) -> Result<RegexSet, Error> where S: AsRef<str>, I: IntoIterator<Item = S>, { RegexSetBuilder::new(exprs).build() } /// Create a new empty regex set. /// /// An empty regex never matches anything. /// /// This is a convenience function for `RegexSet::new([])`, but doesn't /// require one to specify the type of the input. /// /// # Example /// /// ``` /// use regex::bytes::RegexSet; /// /// let set = RegexSet::empty(); /// assert!(set.is_empty()); /// // an empty set matches nothing /// assert!(!set.is_match(b"")); /// ``` pub fn empty() -> RegexSet { let empty: [&str; 0] = []; RegexSetBuilder::new(empty).build().unwrap() } /// Returns true if and only if one of the regexes in this set matches /// the haystack given. /// /// This method should be preferred if you only need to test whether any /// of the regexes in the set should match, but don't care about *which* /// regexes matched. This is because the underlying matching engine will /// quit immediately after seeing the first match instead of continuing to /// find all matches. /// /// Note that as with searches using [`Regex`](crate::bytes::Regex), the /// expression is unanchored by default. That is, if the regex does not /// start with `^` or `\A`, or end with `$` or `\z`, then it is permitted /// to match anywhere in the haystack. /// /// # Example /// /// Tests whether a set matches somewhere in a haystack: /// /// ``` /// use regex::bytes::RegexSet; /// /// let set = RegexSet::new([r"\w+", r"\d+"]).unwrap(); /// assert!(set.is_match(b"foo")); /// assert!(!set.is_match("โ˜ƒ".as_bytes())); /// ``` #[inline] pub fn is_match(&self, haystack: &[u8]) -> bool { self.is_match_at(haystack, 0) } /// Returns true if and only if one of the regexes in this set matches the /// haystack given, with the search starting at the offset given. /// /// The significance of the starting point is that it takes the surrounding /// context into consideration. For example, the `\A` anchor can only /// match when `start == 0`. /// /// # Panics /// /// This panics when `start >= haystack.len() + 1`. /// /// # Example /// /// This example shows the significance of `start`. Namely, consider a /// haystack `foobar` and a desire to execute a search starting at offset /// `3`. You could search a substring explicitly, but then the look-around /// assertions won't work correctly. Instead, you can use this method to /// specify the start position of a search. /// /// ``` /// use regex::bytes::RegexSet; /// /// let set = RegexSet::new([r"\bbar\b", r"(?m)^bar$"]).unwrap(); /// let hay = b"foobar"; /// // We get a match here, but it's probably not intended. /// assert!(set.is_match(&hay[3..])); /// // No match because the assertions take the context into account. /// assert!(!set.is_match_at(hay, 3)); /// ``` #[inline] pub fn is_match_at(&self, haystack: &[u8], start: usize) -> bool { self.meta.is_match(Input::new(haystack).span(start..haystack.len())) } /// Returns the set of regexes that match in the given haystack. /// /// The set returned contains the index of each regex that matches in /// the given haystack. The index is in correspondence with the order of /// regular expressions given to `RegexSet`'s constructor. /// /// The set can also be used to iterate over the matched indices. The order /// of iteration is always ascending with respect to the matching indices. /// /// Note that as with searches using [`Regex`](crate::bytes::Regex), the /// expression is unanchored by default. That is, if the regex does not /// start with `^` or `\A`, or end with `$` or `\z`, then it is permitted /// to match anywhere in the haystack. /// /// # Example /// /// Tests which regular expressions match the given haystack: /// /// ``` /// use regex::bytes::RegexSet; /// /// let set = RegexSet::new([ /// r"\w+", /// r"\d+", /// r"\pL+", /// r"foo", /// r"bar", /// r"barfoo", /// r"foobar", /// ]).unwrap(); /// let matches: Vec<_> = set.matches(b"foobar").into_iter().collect(); /// assert_eq!(matches, vec![0, 2, 3, 4, 6]); /// /// // You can also test whether a particular regex matched: /// let matches = set.matches(b"foobar"); /// assert!(!matches.matched(5)); /// assert!(matches.matched(6)); /// ``` #[inline] pub fn matches(&self, haystack: &[u8]) -> SetMatches { self.matches_at(haystack, 0) } /// Returns the set of regexes that match in the given haystack. /// /// The set returned contains the index of each regex that matches in /// the given haystack. The index is in correspondence with the order of /// regular expressions given to `RegexSet`'s constructor. /// /// The set can also be used to iterate over the matched indices. The order /// of iteration is always ascending with respect to the matching indices. /// /// The significance of the starting point is that it takes the surrounding /// context into consideration. For example, the `\A` anchor can only /// match when `start == 0`. /// /// # Panics /// /// This panics when `start >= haystack.len() + 1`. /// /// # Example /// /// Tests which regular expressions match the given haystack: /// /// ``` /// use regex::bytes::RegexSet; /// /// let set = RegexSet::new([r"\bbar\b", r"(?m)^bar$"]).unwrap(); /// let hay = b"foobar"; /// // We get matches here, but it's probably not intended. /// let matches: Vec<_> = set.matches(&hay[3..]).into_iter().collect(); /// assert_eq!(matches, vec![0, 1]); /// // No matches because the assertions take the context into account. /// let matches: Vec<_> = set.matches_at(hay, 3).into_iter().collect(); /// assert_eq!(matches, vec![]); /// ``` #[inline] pub fn matches_at(&self, haystack: &[u8], start: usize) -> SetMatches { let input = Input::new(haystack).span(start..haystack.len()); let mut patset = PatternSet::new(self.meta.pattern_len()); self.meta.which_overlapping_matches(&input, &mut patset); SetMatches(patset) } /// Returns the same as matches, but starts the search at the given /// offset and stores the matches into the slice given. /// /// The significance of the starting point is that it takes the surrounding /// context into consideration. For example, the `\A` anchor can only /// match when `start == 0`. /// /// `matches` must have a length that is at least the number of regexes /// in this set. /// /// This method returns true if and only if at least one member of /// `matches` is true after executing the set against `haystack`. #[doc(hidden)] #[inline] pub fn matches_read_at( &self, matches: &mut [bool], haystack: &[u8], start: usize, ) -> bool { // This is pretty dumb. We should try to fix this, but the // regex-automata API doesn't provide a way to store matches in an // arbitrary &mut [bool]. Thankfully, this API is doc(hidden) and // thus not public... But regex-capi currently uses it. We should // fix regex-capi to use a PatternSet, maybe? Not sure... PatternSet // is in regex-automata, not regex. So maybe we should just accept a // 'SetMatches', which is basically just a newtype around PatternSet. let mut patset = PatternSet::new(self.meta.pattern_len()); let mut input = Input::new(haystack); input.set_start(start); self.meta.which_overlapping_matches(&input, &mut patset); for pid in patset.iter() { matches[pid] = true; } !patset.is_empty() } /// An alias for `matches_read_at` to preserve backward compatibility. /// /// The `regex-capi` crate used this method, so to avoid breaking that /// crate, we continue to export it as an undocumented API. #[doc(hidden)] #[inline] pub fn read_matches_at( &self, matches: &mut [bool], haystack: &[u8], start: usize, ) -> bool { self.matches_read_at(matches, haystack, start) } /// Returns the total number of regexes in this set. /// /// # Example /// /// ``` /// use regex::bytes::RegexSet; /// /// assert_eq!(0, RegexSet::empty().len()); /// assert_eq!(1, RegexSet::new([r"[0-9]"]).unwrap().len()); /// assert_eq!(2, RegexSet::new([r"[0-9]", r"[a-z]"]).unwrap().len()); /// ``` #[inline] pub fn len(&self) -> usize { self.meta.pattern_len() } /// Returns `true` if this set contains no regexes. /// /// # Example /// /// ``` /// use regex::bytes::RegexSet; /// /// assert!(RegexSet::empty().is_empty()); /// assert!(!RegexSet::new([r"[0-9]"]).unwrap().is_empty()); /// ``` #[inline] pub fn is_empty(&self) -> bool { self.meta.pattern_len() == 0 } /// Returns the regex patterns that this regex set was constructed from. /// /// This function can be used to determine the pattern for a match. The /// slice returned has exactly as many patterns givens to this regex set, /// and the order of the slice is the same as the order of the patterns /// provided to the set. /// /// # Example /// /// ``` /// use regex::bytes::RegexSet; /// /// let set = RegexSet::new(&[ /// r"\w+", /// r"\d+", /// r"\pL+", /// r"foo", /// r"bar", /// r"barfoo", /// r"foobar", /// ]).unwrap(); /// let matches: Vec<_> = set /// .matches(b"foobar") /// .into_iter() /// .map(|index| &set.patterns()[index]) /// .collect(); /// assert_eq!(matches, vec![r"\w+", r"\pL+", r"foo", r"bar", r"foobar"]); /// ``` #[inline] pub fn patterns(&self) -> &[String] { &self.patterns } } impl Default for RegexSet { fn default() -> Self { RegexSet::empty() } } /// A set of matches returned by a regex set. /// /// Values of this type are constructed by [`RegexSet::matches`]. #[derive(Clone, Debug)] pub struct SetMatches(PatternSet); impl SetMatches { /// Whether this set contains any matches. /// /// # Example /// /// ``` /// use regex::bytes::RegexSet; /// /// let set = RegexSet::new(&[ /// r"[a-z]+@[a-z]+\.(com|org|net)", /// r"[a-z]+\.(com|org|net)", /// ]).unwrap(); /// let matches = set.matches(b"foo@example.com"); /// assert!(matches.matched_any()); /// ``` #[inline] pub fn matched_any(&self) -> bool { !self.0.is_empty() } /// Whether all patterns in this set matched. /// /// # Example /// /// ``` /// use regex::bytes::RegexSet; /// /// let set = RegexSet::new(&[ /// r"^foo", /// r"[a-z]+\.com", /// ]).unwrap(); /// let matches = set.matches(b"foo.example.com"); /// assert!(matches.matched_all()); /// ``` pub fn matched_all(&self) -> bool { self.0.is_full() } /// Whether the regex at the given index matched. /// /// The index for a regex is determined by its insertion order upon the /// initial construction of a `RegexSet`, starting at `0`. /// /// # Panics /// /// If `index` is greater than or equal to the number of regexes in the /// original set that produced these matches. Equivalently, when `index` /// is greater than or equal to [`SetMatches::len`]. /// /// # Example /// /// ``` /// use regex::bytes::RegexSet; /// /// let set = RegexSet::new([ /// r"[a-z]+@[a-z]+\.(com|org|net)", /// r"[a-z]+\.(com|org|net)", /// ]).unwrap(); /// let matches = set.matches(b"example.com"); /// assert!(!matches.matched(0)); /// assert!(matches.matched(1)); /// ``` #[inline] pub fn matched(&self, index: usize) -> bool { self.0.contains(PatternID::new_unchecked(index)) } /// The total number of regexes in the set that created these matches. /// /// **WARNING:** This always returns the same value as [`RegexSet::len`]. /// In particular, it does *not* return the number of elements yielded by /// [`SetMatches::iter`]. The only way to determine the total number of /// matched regexes is to iterate over them. /// /// # Example /// /// Notice that this method returns the total number of regexes in the /// original set, and *not* the total number of regexes that matched. /// /// ``` /// use regex::bytes::RegexSet; /// /// let set = RegexSet::new([ /// r"[a-z]+@[a-z]+\.(com|org|net)", /// r"[a-z]+\.(com|org|net)", /// ]).unwrap(); /// let matches = set.matches(b"example.com"); /// // Total number of patterns that matched. /// assert_eq!(1, matches.iter().count()); /// // Total number of patterns in the set. /// assert_eq!(2, matches.len()); /// ``` #[inline] pub fn len(&self) -> usize { self.0.capacity() } /// Returns an iterator over the indices of the regexes that matched. /// /// This will always produces matches in ascending order, where the index /// yielded corresponds to the index of the regex that matched with respect /// to its position when initially building the set. /// /// # Example /// /// ``` /// use regex::bytes::RegexSet; /// /// let set = RegexSet::new([ /// r"[0-9]", /// r"[a-z]", /// r"[A-Z]", /// r"\p{Greek}", /// ]).unwrap(); /// let hay = "ฮฒa1".as_bytes(); /// let matches: Vec<_> = set.matches(hay).iter().collect(); /// assert_eq!(matches, vec![0, 1, 3]); /// ``` /// /// Note that `SetMatches` also implements the `IntoIterator` trait, so /// this method is not always needed. For example: /// /// ``` /// use regex::bytes::RegexSet; /// /// let set = RegexSet::new([ /// r"[0-9]", /// r"[a-z]", /// r"[A-Z]", /// r"\p{Greek}", /// ]).unwrap(); /// let hay = "ฮฒa1".as_bytes(); /// let mut matches = vec![]; /// for index in set.matches(hay) { /// matches.push(index); /// } /// assert_eq!(matches, vec![0, 1, 3]); /// ``` #[inline] pub fn iter(&self) -> SetMatchesIter<'_> { SetMatchesIter(self.0.iter()) } } impl IntoIterator for SetMatches { type IntoIter = SetMatchesIntoIter; type Item = usize; fn into_iter(self) -> Self::IntoIter { let it = 0..self.0.capacity(); SetMatchesIntoIter { patset: self.0, it } } } impl<'a> IntoIterator for &'a SetMatches { type IntoIter = SetMatchesIter<'a>; type Item = usize; fn into_iter(self) -> Self::IntoIter { self.iter() } } /// An owned iterator over the set of matches from a regex set. /// /// This will always produces matches in ascending order of index, where the /// index corresponds to the index of the regex that matched with respect to /// its position when initially building the set. /// /// This iterator is created by calling `SetMatches::into_iter` via the /// `IntoIterator` trait. This is automatically done in `for` loops. /// /// # Example /// /// ``` /// use regex::bytes::RegexSet; /// /// let set = RegexSet::new([ /// r"[0-9]", /// r"[a-z]", /// r"[A-Z]", /// r"\p{Greek}", /// ]).unwrap(); /// let hay = "ฮฒa1".as_bytes(); /// let mut matches = vec![]; /// for index in set.matches(hay) { /// matches.push(index); /// } /// assert_eq!(matches, vec![0, 1, 3]); /// ``` #[derive(Debug)] pub struct SetMatchesIntoIter { patset: PatternSet, it: core::ops::Range<usize>, } impl Iterator for SetMatchesIntoIter { type Item = usize; fn next(&mut self) -> Option<usize> { loop { let id = self.it.next()?; if self.patset.contains(PatternID::new_unchecked(id)) { return Some(id); } } } fn size_hint(&self) -> (usize, Option<usize>) { self.it.size_hint() } } impl DoubleEndedIterator for SetMatchesIntoIter { fn next_back(&mut self) -> Option<usize> { loop { let id = self.it.next_back()?; if self.patset.contains(PatternID::new_unchecked(id)) { return Some(id); } } } } impl core::iter::FusedIterator for SetMatchesIntoIter {} /// A borrowed iterator over the set of matches from a regex set. /// /// The lifetime `'a` refers to the lifetime of the [`SetMatches`] value that /// created this iterator. /// /// This will always produces matches in ascending order, where the index /// corresponds to the index of the regex that matched with respect to its /// position when initially building the set. /// /// This iterator is created by the [`SetMatches::iter`] method. #[derive(Clone, Debug)] pub struct SetMatchesIter<'a>(PatternSetIter<'a>); impl<'a> Iterator for SetMatchesIter<'a> { type Item = usize; fn next(&mut self) -> Option<usize> { self.0.next().map(|pid| pid.as_usize()) } fn size_hint(&self) -> (usize, Option<usize>) { self.0.size_hint() } } impl<'a> DoubleEndedIterator for SetMatchesIter<'a> { fn next_back(&mut self) -> Option<usize> { self.0.next_back().map(|pid| pid.as_usize()) } } impl<'a> core::iter::FusedIterator for SetMatchesIter<'a> {} impl core::fmt::Debug for RegexSet { fn fmt(&self, f: &mut core::fmt::Formatter<'_>) -> core::fmt::Result { write!(f, "RegexSet({:?})", self.patterns()) } } ������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������regex-1.12.2/src/regexset/mod.rs��������������������������������������������������������������������0000644�0000000�0000000�00000000055�10461020230�0014563�0����������������������������������������������������������������������������������������������������ustar �����������������������������������������������������������������0000000�0000000������������������������������������������������������������������������������������������������������������������������������������������������������������������������pub(crate) mod bytes; pub(crate) mod string; �����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������regex-1.12.2/src/regexset/string.rs�����������������������������������������������������������������0000644�0000000�0000000�00000056561�10461020230�0015327�0����������������������������������������������������������������������������������������������������ustar �����������������������������������������������������������������0000000�0000000������������������������������������������������������������������������������������������������������������������������������������������������������������������������use alloc::string::String; use regex_automata::{meta, Input, PatternID, PatternSet, PatternSetIter}; use crate::{Error, RegexSetBuilder}; /// Match multiple, possibly overlapping, regexes in a single search. /// /// A regex set corresponds to the union of zero or more regular expressions. /// That is, a regex set will match a haystack when at least one of its /// constituent regexes matches. A regex set as its formulated here provides a /// touch more power: it will also report *which* regular expressions in the /// set match. Indeed, this is the key difference between regex sets and a /// single `Regex` with many alternates, since only one alternate can match at /// a time. /// /// For example, consider regular expressions to match email addresses and /// domains: `[a-z]+@[a-z]+\.(com|org|net)` and `[a-z]+\.(com|org|net)`. If a /// regex set is constructed from those regexes, then searching the haystack /// `foo@example.com` will report both regexes as matching. Of course, one /// could accomplish this by compiling each regex on its own and doing two /// searches over the haystack. The key advantage of using a regex set is /// that it will report the matching regexes using a *single pass through the /// haystack*. If one has hundreds or thousands of regexes to match repeatedly /// (like a URL router for a complex web application or a user agent matcher), /// then a regex set *can* realize huge performance gains. /// /// # Limitations /// /// Regex sets are limited to answering the following two questions: /// /// 1. Does any regex in the set match? /// 2. If so, which regexes in the set match? /// /// As with the main [`Regex`][crate::Regex] type, it is cheaper to ask (1) /// instead of (2) since the matching engines can stop after the first match /// is found. /// /// You cannot directly extract [`Match`][crate::Match] or /// [`Captures`][crate::Captures] objects from a regex set. If you need these /// operations, the recommended approach is to compile each pattern in the set /// independently and scan the exact same haystack a second time with those /// independently compiled patterns: /// /// ``` /// use regex::{Regex, RegexSet}; /// /// let patterns = ["foo", "bar"]; /// // Both patterns will match different ranges of this string. /// let hay = "barfoo"; /// /// // Compile a set matching any of our patterns. /// let set = RegexSet::new(patterns).unwrap(); /// // Compile each pattern independently. /// let regexes: Vec<_> = set /// .patterns() /// .iter() /// .map(|pat| Regex::new(pat).unwrap()) /// .collect(); /// /// // Match against the whole set first and identify the individual /// // matching patterns. /// let matches: Vec<&str> = set /// .matches(hay) /// .into_iter() /// // Dereference the match index to get the corresponding /// // compiled pattern. /// .map(|index| ®exes[index]) /// // To get match locations or any other info, we then have to search the /// // exact same haystack again, using our separately-compiled pattern. /// .map(|re| re.find(hay).unwrap().as_str()) /// .collect(); /// /// // Matches arrive in the order the constituent patterns were declared, /// // not the order they appear in the haystack. /// assert_eq!(vec!["foo", "bar"], matches); /// ``` /// /// # Performance /// /// A `RegexSet` has the same performance characteristics as `Regex`. Namely, /// search takes `O(m * n)` time, where `m` is proportional to the size of the /// regex set and `n` is proportional to the length of the haystack. /// /// # Trait implementations /// /// The `Default` trait is implemented for `RegexSet`. The default value /// is an empty set. An empty set can also be explicitly constructed via /// [`RegexSet::empty`]. /// /// # Example /// /// This shows how the above two regexes (for matching email addresses and /// domains) might work: /// /// ``` /// use regex::RegexSet; /// /// let set = RegexSet::new(&[ /// r"[a-z]+@[a-z]+\.(com|org|net)", /// r"[a-z]+\.(com|org|net)", /// ]).unwrap(); /// /// // Ask whether any regexes in the set match. /// assert!(set.is_match("foo@example.com")); /// /// // Identify which regexes in the set match. /// let matches: Vec<_> = set.matches("foo@example.com").into_iter().collect(); /// assert_eq!(vec![0, 1], matches); /// /// // Try again, but with a haystack that only matches one of the regexes. /// let matches: Vec<_> = set.matches("example.com").into_iter().collect(); /// assert_eq!(vec![1], matches); /// /// // Try again, but with a haystack that doesn't match any regex in the set. /// let matches: Vec<_> = set.matches("example").into_iter().collect(); /// assert!(matches.is_empty()); /// ``` /// /// Note that it would be possible to adapt the above example to using `Regex` /// with an expression like: /// /// ```text /// (?P<email>[a-z]+@(?P<email_domain>[a-z]+[.](com|org|net)))|(?P<domain>[a-z]+[.](com|org|net)) /// ``` /// /// After a match, one could then inspect the capture groups to figure out /// which alternates matched. The problem is that it is hard to make this /// approach scale when there are many regexes since the overlap between each /// alternate isn't always obvious to reason about. #[derive(Clone)] pub struct RegexSet { pub(crate) meta: meta::Regex, pub(crate) patterns: alloc::sync::Arc<[String]>, } impl RegexSet { /// Create a new regex set with the given regular expressions. /// /// This takes an iterator of `S`, where `S` is something that can produce /// a `&str`. If any of the strings in the iterator are not valid regular /// expressions, then an error is returned. /// /// # Example /// /// Create a new regex set from an iterator of strings: /// /// ``` /// use regex::RegexSet; /// /// let set = RegexSet::new([r"\w+", r"\d+"]).unwrap(); /// assert!(set.is_match("foo")); /// ``` pub fn new<I, S>(exprs: I) -> Result<RegexSet, Error> where S: AsRef<str>, I: IntoIterator<Item = S>, { RegexSetBuilder::new(exprs).build() } /// Create a new empty regex set. /// /// An empty regex never matches anything. /// /// This is a convenience function for `RegexSet::new([])`, but doesn't /// require one to specify the type of the input. /// /// # Example /// /// ``` /// use regex::RegexSet; /// /// let set = RegexSet::empty(); /// assert!(set.is_empty()); /// // an empty set matches nothing /// assert!(!set.is_match("")); /// ``` pub fn empty() -> RegexSet { let empty: [&str; 0] = []; RegexSetBuilder::new(empty).build().unwrap() } /// Returns true if and only if one of the regexes in this set matches /// the haystack given. /// /// This method should be preferred if you only need to test whether any /// of the regexes in the set should match, but don't care about *which* /// regexes matched. This is because the underlying matching engine will /// quit immediately after seeing the first match instead of continuing to /// find all matches. /// /// Note that as with searches using [`Regex`](crate::Regex), the /// expression is unanchored by default. That is, if the regex does not /// start with `^` or `\A`, or end with `$` or `\z`, then it is permitted /// to match anywhere in the haystack. /// /// # Example /// /// Tests whether a set matches somewhere in a haystack: /// /// ``` /// use regex::RegexSet; /// /// let set = RegexSet::new([r"\w+", r"\d+"]).unwrap(); /// assert!(set.is_match("foo")); /// assert!(!set.is_match("โ˜ƒ")); /// ``` #[inline] pub fn is_match(&self, haystack: &str) -> bool { self.is_match_at(haystack, 0) } /// Returns true if and only if one of the regexes in this set matches the /// haystack given, with the search starting at the offset given. /// /// The significance of the starting point is that it takes the surrounding /// context into consideration. For example, the `\A` anchor can only /// match when `start == 0`. /// /// # Panics /// /// This panics when `start >= haystack.len() + 1`. /// /// # Example /// /// This example shows the significance of `start`. Namely, consider a /// haystack `foobar` and a desire to execute a search starting at offset /// `3`. You could search a substring explicitly, but then the look-around /// assertions won't work correctly. Instead, you can use this method to /// specify the start position of a search. /// /// ``` /// use regex::RegexSet; /// /// let set = RegexSet::new([r"\bbar\b", r"(?m)^bar$"]).unwrap(); /// let hay = "foobar"; /// // We get a match here, but it's probably not intended. /// assert!(set.is_match(&hay[3..])); /// // No match because the assertions take the context into account. /// assert!(!set.is_match_at(hay, 3)); /// ``` #[inline] pub fn is_match_at(&self, haystack: &str, start: usize) -> bool { self.meta.is_match(Input::new(haystack).span(start..haystack.len())) } /// Returns the set of regexes that match in the given haystack. /// /// The set returned contains the index of each regex that matches in /// the given haystack. The index is in correspondence with the order of /// regular expressions given to `RegexSet`'s constructor. /// /// The set can also be used to iterate over the matched indices. The order /// of iteration is always ascending with respect to the matching indices. /// /// Note that as with searches using [`Regex`](crate::Regex), the /// expression is unanchored by default. That is, if the regex does not /// start with `^` or `\A`, or end with `$` or `\z`, then it is permitted /// to match anywhere in the haystack. /// /// # Example /// /// Tests which regular expressions match the given haystack: /// /// ``` /// use regex::RegexSet; /// /// let set = RegexSet::new([ /// r"\w+", /// r"\d+", /// r"\pL+", /// r"foo", /// r"bar", /// r"barfoo", /// r"foobar", /// ]).unwrap(); /// let matches: Vec<_> = set.matches("foobar").into_iter().collect(); /// assert_eq!(matches, vec![0, 2, 3, 4, 6]); /// /// // You can also test whether a particular regex matched: /// let matches = set.matches("foobar"); /// assert!(!matches.matched(5)); /// assert!(matches.matched(6)); /// ``` #[inline] pub fn matches(&self, haystack: &str) -> SetMatches { self.matches_at(haystack, 0) } /// Returns the set of regexes that match in the given haystack. /// /// The set returned contains the index of each regex that matches in /// the given haystack. The index is in correspondence with the order of /// regular expressions given to `RegexSet`'s constructor. /// /// The set can also be used to iterate over the matched indices. The order /// of iteration is always ascending with respect to the matching indices. /// /// The significance of the starting point is that it takes the surrounding /// context into consideration. For example, the `\A` anchor can only /// match when `start == 0`. /// /// # Panics /// /// This panics when `start >= haystack.len() + 1`. /// /// # Example /// /// Tests which regular expressions match the given haystack: /// /// ``` /// use regex::RegexSet; /// /// let set = RegexSet::new([r"\bbar\b", r"(?m)^bar$"]).unwrap(); /// let hay = "foobar"; /// // We get matches here, but it's probably not intended. /// let matches: Vec<_> = set.matches(&hay[3..]).into_iter().collect(); /// assert_eq!(matches, vec![0, 1]); /// // No matches because the assertions take the context into account. /// let matches: Vec<_> = set.matches_at(hay, 3).into_iter().collect(); /// assert_eq!(matches, vec![]); /// ``` #[inline] pub fn matches_at(&self, haystack: &str, start: usize) -> SetMatches { let input = Input::new(haystack).span(start..haystack.len()); let mut patset = PatternSet::new(self.meta.pattern_len()); self.meta.which_overlapping_matches(&input, &mut patset); SetMatches(patset) } /// Returns the same as matches, but starts the search at the given /// offset and stores the matches into the slice given. /// /// The significance of the starting point is that it takes the surrounding /// context into consideration. For example, the `\A` anchor can only /// match when `start == 0`. /// /// `matches` must have a length that is at least the number of regexes /// in this set. /// /// This method returns true if and only if at least one member of /// `matches` is true after executing the set against `haystack`. #[doc(hidden)] #[inline] pub fn matches_read_at( &self, matches: &mut [bool], haystack: &str, start: usize, ) -> bool { // This is pretty dumb. We should try to fix this, but the // regex-automata API doesn't provide a way to store matches in an // arbitrary &mut [bool]. Thankfully, this API is doc(hidden) and // thus not public... But regex-capi currently uses it. We should // fix regex-capi to use a PatternSet, maybe? Not sure... PatternSet // is in regex-automata, not regex. So maybe we should just accept a // 'SetMatches', which is basically just a newtype around PatternSet. let mut patset = PatternSet::new(self.meta.pattern_len()); let mut input = Input::new(haystack); input.set_start(start); self.meta.which_overlapping_matches(&input, &mut patset); for pid in patset.iter() { matches[pid] = true; } !patset.is_empty() } /// An alias for `matches_read_at` to preserve backward compatibility. /// /// The `regex-capi` crate used this method, so to avoid breaking that /// crate, we continue to export it as an undocumented API. #[doc(hidden)] #[inline] pub fn read_matches_at( &self, matches: &mut [bool], haystack: &str, start: usize, ) -> bool { self.matches_read_at(matches, haystack, start) } /// Returns the total number of regexes in this set. /// /// # Example /// /// ``` /// use regex::RegexSet; /// /// assert_eq!(0, RegexSet::empty().len()); /// assert_eq!(1, RegexSet::new([r"[0-9]"]).unwrap().len()); /// assert_eq!(2, RegexSet::new([r"[0-9]", r"[a-z]"]).unwrap().len()); /// ``` #[inline] pub fn len(&self) -> usize { self.meta.pattern_len() } /// Returns `true` if this set contains no regexes. /// /// # Example /// /// ``` /// use regex::RegexSet; /// /// assert!(RegexSet::empty().is_empty()); /// assert!(!RegexSet::new([r"[0-9]"]).unwrap().is_empty()); /// ``` #[inline] pub fn is_empty(&self) -> bool { self.meta.pattern_len() == 0 } /// Returns the regex patterns that this regex set was constructed from. /// /// This function can be used to determine the pattern for a match. The /// slice returned has exactly as many patterns givens to this regex set, /// and the order of the slice is the same as the order of the patterns /// provided to the set. /// /// # Example /// /// ``` /// use regex::RegexSet; /// /// let set = RegexSet::new(&[ /// r"\w+", /// r"\d+", /// r"\pL+", /// r"foo", /// r"bar", /// r"barfoo", /// r"foobar", /// ]).unwrap(); /// let matches: Vec<_> = set /// .matches("foobar") /// .into_iter() /// .map(|index| &set.patterns()[index]) /// .collect(); /// assert_eq!(matches, vec![r"\w+", r"\pL+", r"foo", r"bar", r"foobar"]); /// ``` #[inline] pub fn patterns(&self) -> &[String] { &self.patterns } } impl Default for RegexSet { fn default() -> Self { RegexSet::empty() } } /// A set of matches returned by a regex set. /// /// Values of this type are constructed by [`RegexSet::matches`]. #[derive(Clone, Debug)] pub struct SetMatches(PatternSet); impl SetMatches { /// Whether this set contains any matches. /// /// # Example /// /// ``` /// use regex::RegexSet; /// /// let set = RegexSet::new(&[ /// r"[a-z]+@[a-z]+\.(com|org|net)", /// r"[a-z]+\.(com|org|net)", /// ]).unwrap(); /// let matches = set.matches("foo@example.com"); /// assert!(matches.matched_any()); /// ``` #[inline] pub fn matched_any(&self) -> bool { !self.0.is_empty() } /// Whether all patterns in this set matched. /// /// # Example /// /// ``` /// use regex::RegexSet; /// /// let set = RegexSet::new(&[ /// r"^foo", /// r"[a-z]+\.com", /// ]).unwrap(); /// let matches = set.matches("foo.example.com"); /// assert!(matches.matched_all()); /// ``` pub fn matched_all(&self) -> bool { self.0.is_full() } /// Whether the regex at the given index matched. /// /// The index for a regex is determined by its insertion order upon the /// initial construction of a `RegexSet`, starting at `0`. /// /// # Panics /// /// If `index` is greater than or equal to the number of regexes in the /// original set that produced these matches. Equivalently, when `index` /// is greater than or equal to [`SetMatches::len`]. /// /// # Example /// /// ``` /// use regex::RegexSet; /// /// let set = RegexSet::new([ /// r"[a-z]+@[a-z]+\.(com|org|net)", /// r"[a-z]+\.(com|org|net)", /// ]).unwrap(); /// let matches = set.matches("example.com"); /// assert!(!matches.matched(0)); /// assert!(matches.matched(1)); /// ``` #[inline] pub fn matched(&self, index: usize) -> bool { self.0.contains(PatternID::new_unchecked(index)) } /// The total number of regexes in the set that created these matches. /// /// **WARNING:** This always returns the same value as [`RegexSet::len`]. /// In particular, it does *not* return the number of elements yielded by /// [`SetMatches::iter`]. The only way to determine the total number of /// matched regexes is to iterate over them. /// /// # Example /// /// Notice that this method returns the total number of regexes in the /// original set, and *not* the total number of regexes that matched. /// /// ``` /// use regex::RegexSet; /// /// let set = RegexSet::new([ /// r"[a-z]+@[a-z]+\.(com|org|net)", /// r"[a-z]+\.(com|org|net)", /// ]).unwrap(); /// let matches = set.matches("example.com"); /// // Total number of patterns that matched. /// assert_eq!(1, matches.iter().count()); /// // Total number of patterns in the set. /// assert_eq!(2, matches.len()); /// ``` #[inline] pub fn len(&self) -> usize { self.0.capacity() } /// Returns an iterator over the indices of the regexes that matched. /// /// This will always produces matches in ascending order, where the index /// yielded corresponds to the index of the regex that matched with respect /// to its position when initially building the set. /// /// # Example /// /// ``` /// use regex::RegexSet; /// /// let set = RegexSet::new([ /// r"[0-9]", /// r"[a-z]", /// r"[A-Z]", /// r"\p{Greek}", /// ]).unwrap(); /// let hay = "ฮฒa1"; /// let matches: Vec<_> = set.matches(hay).iter().collect(); /// assert_eq!(matches, vec![0, 1, 3]); /// ``` /// /// Note that `SetMatches` also implements the `IntoIterator` trait, so /// this method is not always needed. For example: /// /// ``` /// use regex::RegexSet; /// /// let set = RegexSet::new([ /// r"[0-9]", /// r"[a-z]", /// r"[A-Z]", /// r"\p{Greek}", /// ]).unwrap(); /// let hay = "ฮฒa1"; /// let mut matches = vec![]; /// for index in set.matches(hay) { /// matches.push(index); /// } /// assert_eq!(matches, vec![0, 1, 3]); /// ``` #[inline] pub fn iter(&self) -> SetMatchesIter<'_> { SetMatchesIter(self.0.iter()) } } impl IntoIterator for SetMatches { type IntoIter = SetMatchesIntoIter; type Item = usize; fn into_iter(self) -> Self::IntoIter { let it = 0..self.0.capacity(); SetMatchesIntoIter { patset: self.0, it } } } impl<'a> IntoIterator for &'a SetMatches { type IntoIter = SetMatchesIter<'a>; type Item = usize; fn into_iter(self) -> Self::IntoIter { self.iter() } } /// An owned iterator over the set of matches from a regex set. /// /// This will always produces matches in ascending order of index, where the /// index corresponds to the index of the regex that matched with respect to /// its position when initially building the set. /// /// This iterator is created by calling `SetMatches::into_iter` via the /// `IntoIterator` trait. This is automatically done in `for` loops. /// /// # Example /// /// ``` /// use regex::RegexSet; /// /// let set = RegexSet::new([ /// r"[0-9]", /// r"[a-z]", /// r"[A-Z]", /// r"\p{Greek}", /// ]).unwrap(); /// let hay = "ฮฒa1"; /// let mut matches = vec![]; /// for index in set.matches(hay) { /// matches.push(index); /// } /// assert_eq!(matches, vec![0, 1, 3]); /// ``` #[derive(Debug)] pub struct SetMatchesIntoIter { patset: PatternSet, it: core::ops::Range<usize>, } impl Iterator for SetMatchesIntoIter { type Item = usize; fn next(&mut self) -> Option<usize> { loop { let id = self.it.next()?; if self.patset.contains(PatternID::new_unchecked(id)) { return Some(id); } } } fn size_hint(&self) -> (usize, Option<usize>) { self.it.size_hint() } } impl DoubleEndedIterator for SetMatchesIntoIter { fn next_back(&mut self) -> Option<usize> { loop { let id = self.it.next_back()?; if self.patset.contains(PatternID::new_unchecked(id)) { return Some(id); } } } } impl core::iter::FusedIterator for SetMatchesIntoIter {} /// A borrowed iterator over the set of matches from a regex set. /// /// The lifetime `'a` refers to the lifetime of the [`SetMatches`] value that /// created this iterator. /// /// This will always produces matches in ascending order, where the index /// corresponds to the index of the regex that matched with respect to its /// position when initially building the set. /// /// This iterator is created by the [`SetMatches::iter`] method. #[derive(Clone, Debug)] pub struct SetMatchesIter<'a>(PatternSetIter<'a>); impl<'a> Iterator for SetMatchesIter<'a> { type Item = usize; fn next(&mut self) -> Option<usize> { self.0.next().map(|pid| pid.as_usize()) } fn size_hint(&self) -> (usize, Option<usize>) { self.0.size_hint() } } impl<'a> DoubleEndedIterator for SetMatchesIter<'a> { fn next_back(&mut self) -> Option<usize> { self.0.next_back().map(|pid| pid.as_usize()) } } impl<'a> core::iter::FusedIterator for SetMatchesIter<'a> {} impl core::fmt::Debug for RegexSet { fn fmt(&self, f: &mut core::fmt::Formatter<'_>) -> core::fmt::Result { write!(f, "RegexSet({:?})", self.patterns()) } } �����������������������������������������������������������������������������������������������������������������������������������������������regex-1.12.2/test�����������������������������������������������������������������������������������0000755�0000000�0000000�00000002650�10461020230�0011731�0����������������������������������������������������������������������������������������������������ustar �����������������������������������������������������������������0000000�0000000������������������������������������������������������������������������������������������������������������������������������������������������������������������������#!/bin/bash set -e # cd to the directory containing this crate's Cargo.toml so that we don't need # to pass --manifest-path to every `cargo` command. cd "$(dirname "$0")" # This is a convenience script for running a broad swath of tests across # features. We don't test the complete space, since the complete space is quite # large. Hopefully once we migrate the test suite to better infrastructure # (like regex-automata), we'll be able to test more of the space. echo "===== DEFAULT FEATURES =====" cargo test # no-std mode is annoyingly difficult to test. Currently, the integration tests # don't run. So for now, we just test that library tests run. (There aren't # many because `regex` is just a wrapper crate.) cargo test --no-default-features --lib echo "===== DOC TESTS =====" cargo test --doc features=( "std" "std unicode" "std unicode-perl" "std perf" "std perf-cache" "std perf-dfa" "std perf-inline" "std perf-literal" "std perf-dfa-full" "std perf-onepass" "std perf-backtrack" ) for f in "${features[@]}"; do echo "===== FEATURE: $f =====" cargo test --test integration --no-default-features --features "$f" done # And test the probably-forever-nightly-only 'pattern' feature... if rustc --version | grep -q nightly; then echo "===== FEATURE: std,pattern,unicode-perl =====" cargo test --test integration --no-default-features --features std,pattern,unicode-perl fi ����������������������������������������������������������������������������������������regex-1.12.2/testdata/README.md���������������������������������������������������������������������0000644�0000000�0000000�00000002105�10461020230�0014107�0����������������������������������������������������������������������������������������������������ustar �����������������������������������������������������������������0000000�0000000������������������������������������������������������������������������������������������������������������������������������������������������������������������������This directory contains a large suite of regex tests defined in a TOML format. They are used to drive tests in `tests/lib.rs`, `regex-automata/tests/lib.rs` and `regex-lite/tests/lib.rs`. See the [`regex-test`][regex-test] crate documentation for an explanation of the format and how it generates tests. The basic idea here is that we have many different regex engines but generally one set of tests. We want to be able to run those tests (or most of them) on every engine. Prior to `regex 1.9`, we used to do this with a hodge podge soup of macros and a different test executable for each engine. It overall took a longer time to compile, was harder to maintain, and it made the test definitions themselves less clear. In `regex 1.9`, when we moved over to `regex-automata`, the situation got a lot worse because of an increase in the number of engines. So I devised an engine independent format for testing regex patterns and their semantics. Note: the naming scheme used in these tests isn't terribly consistent. It would be great to fix that. [regex-test]: https://docs.rs/regex-test �����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������regex-1.12.2/testdata/anchored.toml�����������������������������������������������������������������0000644�0000000�0000000�00000005726�10461020230�0015324�0����������������������������������������������������������������������������������������������������ustar �����������������������������������������������������������������0000000�0000000������������������������������������������������������������������������������������������������������������������������������������������������������������������������# These tests are specifically geared toward searches with 'anchored = true'. # While they are interesting in their own right, they are particularly # important for testing the one-pass DFA since the one-pass DFA can't work in # unanchored contexts. # # Note that "anchored" in this context does not mean "^". Anchored searches are # searches whose matches must begin at the start of the search, which may not # be at the start of the haystack. That's why anchored searches---and there are # some examples below---can still report multiple matches. This occurs when the # matches are adjacent to one another. [[test]] name = "greedy" regex = '(abc)+' haystack = "abcabcabc" matches = [ [[0, 9], [6, 9]], ] anchored = true # When a "earliest" search is used, greediness doesn't really exist because # matches are reported as soon as they are known. [[test]] name = "greedy-earliest" regex = '(abc)+' haystack = "abcabcabc" matches = [ [[0, 3], [0, 3]], [[3, 6], [3, 6]], [[6, 9], [6, 9]], ] anchored = true search-kind = "earliest" [[test]] name = "nongreedy" regex = '(abc)+?' haystack = "abcabcabc" matches = [ [[0, 3], [0, 3]], [[3, 6], [3, 6]], [[6, 9], [6, 9]], ] anchored = true # When "all" semantics are used, non-greediness doesn't exist since the longest # possible match is always taken. [[test]] name = "nongreedy-all" regex = '(abc)+?' haystack = "abcabcabc" matches = [ [[0, 9], [6, 9]], ] anchored = true match-kind = "all" [[test]] name = "word-boundary-unicode-01" regex = '\b\w+\b' haystack = 'ฮฒฮฒฮฒโ˜ƒ' matches = [[0, 6]] anchored = true [[test]] name = "word-boundary-nounicode-01" regex = '\b\w+\b' haystack = 'abcฮฒ' matches = [[0, 3]] anchored = true unicode = false # Tests that '.c' doesn't match 'abc' when performing an anchored search from # the beginning of the haystack. This test found two different bugs in the # PikeVM and the meta engine. [[test]] name = "no-match-at-start" regex = '.c' haystack = 'abc' matches = [] anchored = true # Like above, but at a non-zero start offset. [[test]] name = "no-match-at-start-bounds" regex = '.c' haystack = 'aabc' bounds = [1, 4] matches = [] anchored = true # This is like no-match-at-start, but hits the "reverse inner" optimization # inside the meta engine. (no-match-at-start hits the "reverse suffix" # optimization.) [[test]] name = "no-match-at-start-reverse-inner" regex = '.c[a-z]' haystack = 'abcz' matches = [] anchored = true # Like above, but at a non-zero start offset. [[test]] name = "no-match-at-start-reverse-inner-bounds" regex = '.c[a-z]' haystack = 'aabcz' bounds = [1, 5] matches = [] anchored = true # Same as no-match-at-start, but applies to the meta engine's "reverse # anchored" optimization. [[test]] name = "no-match-at-start-reverse-anchored" regex = '.c[a-z]$' haystack = 'abcz' matches = [] anchored = true # Like above, but at a non-zero start offset. [[test]] name = "no-match-at-start-reverse-anchored-bounds" regex = '.c[a-z]$' haystack = 'aabcz' bounds = [1, 5] matches = [] anchored = true ������������������������������������������regex-1.12.2/testdata/bytes.toml��������������������������������������������������������������������0000644�0000000�0000000�00000010575�10461020230�0014665�0����������������������������������������������������������������������������������������������������ustar �����������������������������������������������������������������0000000�0000000������������������������������������������������������������������������������������������������������������������������������������������������������������������������# These are tests specifically crafted for regexes that can match arbitrary # bytes. In some cases, we also test the Unicode variant as well, just because # it's good sense to do so. But also, these tests aren't really about Unicode, # but whether matches are only reported at valid UTF-8 boundaries. For most # tests in this entire collection, utf8 = true. But for these tests, we use # utf8 = false. [[test]] name = "word-boundary-ascii" regex = ' \b' haystack = " ฮด" matches = [] unicode = false utf8 = false [[test]] name = "word-boundary-unicode" regex = ' \b' haystack = " ฮด" matches = [[0, 1]] unicode = true utf8 = false [[test]] name = "word-boundary-ascii-not" regex = ' \B' haystack = " ฮด" matches = [[0, 1]] unicode = false utf8 = false [[test]] name = "word-boundary-unicode-not" regex = ' \B' haystack = " ฮด" matches = [] unicode = true utf8 = false [[test]] name = "perl-word-ascii" regex = '\w+' haystack = "aฮด" matches = [[0, 1]] unicode = false utf8 = false [[test]] name = "perl-word-unicode" regex = '\w+' haystack = "aฮด" matches = [[0, 3]] unicode = true utf8 = false [[test]] name = "perl-decimal-ascii" regex = '\d+' haystack = "1เฅจเฅฉ9" matches = [[0, 1], [7, 8]] unicode = false utf8 = false [[test]] name = "perl-decimal-unicode" regex = '\d+' haystack = "1เฅจเฅฉ9" matches = [[0, 8]] unicode = true utf8 = false [[test]] name = "perl-whitespace-ascii" regex = '\s+' haystack = " \u1680" matches = [[0, 1]] unicode = false utf8 = false [[test]] name = "perl-whitespace-unicode" regex = '\s+' haystack = " \u1680" matches = [[0, 4]] unicode = true utf8 = false # The first `(.+)` matches two Unicode codepoints, but can't match the 5th # byte, which isn't valid UTF-8. The second (byte based) `(.+)` takes over and # matches. [[test]] name = "mixed-dot" regex = '(.+)(?-u)(.+)' haystack = '\xCE\x93\xCE\x94\xFF' matches = [ [[0, 5], [0, 4], [4, 5]], ] unescape = true unicode = true utf8 = false [[test]] name = "case-one-ascii" regex = 'a' haystack = "A" matches = [[0, 1]] case-insensitive = true unicode = false utf8 = false [[test]] name = "case-one-unicode" regex = 'a' haystack = "A" matches = [[0, 1]] case-insensitive = true unicode = true utf8 = false [[test]] name = "case-class-simple-ascii" regex = '[a-z]+' haystack = "AaAaA" matches = [[0, 5]] case-insensitive = true unicode = false utf8 = false [[test]] name = "case-class-ascii" regex = '[a-z]+' haystack = "aA\u212AaA" matches = [[0, 2], [5, 7]] case-insensitive = true unicode = false utf8 = false [[test]] name = "case-class-unicode" regex = '[a-z]+' haystack = "aA\u212AaA" matches = [[0, 7]] case-insensitive = true unicode = true utf8 = false [[test]] name = "negate-ascii" regex = '[^a]' haystack = "ฮด" matches = [[0, 1], [1, 2]] unicode = false utf8 = false [[test]] name = "negate-unicode" regex = '[^a]' haystack = "ฮด" matches = [[0, 2]] unicode = true utf8 = false # When utf8=true, this won't match, because the implicit '.*?' prefix is # Unicode aware and will refuse to match through invalid UTF-8 bytes. [[test]] name = "dotstar-prefix-ascii" regex = 'a' haystack = '\xFFa' matches = [[1, 2]] unescape = true unicode = false utf8 = false [[test]] name = "dotstar-prefix-unicode" regex = 'a' haystack = '\xFFa' matches = [[1, 2]] unescape = true unicode = true utf8 = false [[test]] name = "null-bytes" regex = '(?P<cstr>[^\x00]+)\x00' haystack = 'foo\x00' matches = [ [[0, 4], [0, 3]], ] unescape = true unicode = false utf8 = false [[test]] name = "invalid-utf8-anchor-100" regex = '\xCC?^' haystack = '\x8d#;\x1a\xa4s3\x05foobarX\\\x0f0t\xe4\x9b\xa4' matches = [[0, 0]] unescape = true unicode = false utf8 = false [[test]] name = "invalid-utf8-anchor-200" regex = '^\xf7|4\xff\d\x8a\x8a\x8a\x8a\x8a\x8a\x8a\x8a\x8a\x8a\x8a\x8a\x8a##########[] d\x8a\x8a\x8a\x8a\x8a\x8a\x8a\x8a\x8a\x8a\x8a\x8a\x8a##########\[] #####\x80\S7|$' haystack = '\x8d#;\x1a\xa4s3\x05foobarX\\\x0f0t\xe4\x9b\xa4' matches = [[22, 22]] unescape = true unicode = false utf8 = false [[test]] name = "invalid-utf8-anchor-300" regex = '^|ddp\xff\xffdddddlQd@\x80' haystack = '\x8d#;\x1a\xa4s3\x05foobarX\\\x0f0t\xe4\x9b\xa4' matches = [[0, 0]] unescape = true unicode = false utf8 = false [[test]] name = "word-boundary-ascii-100" regex = '\Bx\B' haystack = "รกxฮฒ" matches = [] unicode = false utf8 = false [[test]] name = "word-boundary-ascii-200" regex = '\B' haystack = "0\U0007EF5E" matches = [[2, 2], [3, 3], [4, 4], [5, 5]] unicode = false utf8 = false �����������������������������������������������������������������������������������������������������������������������������������regex-1.12.2/testdata/crazy.toml��������������������������������������������������������������������0000644�0000000�0000000�00000012711�10461020230�0014661�0����������������������������������������������������������������������������������������������������ustar �����������������������������������������������������������������0000000�0000000������������������������������������������������������������������������������������������������������������������������������������������������������������������������[[test]] name = "nothing-empty" regex = [] haystack = "" matches = [] [[test]] name = "nothing-something" regex = [] haystack = "wat" matches = [] [[test]] name = "ranges" regex = '(?-u)\b(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])\b' haystack = "num: 255" matches = [[5, 8]] [[test]] name = "ranges-not" regex = '(?-u)\b(?:[0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])\b' haystack = "num: 256" matches = [] [[test]] name = "float1" regex = '[-+]?[0-9]*\.?[0-9]+' haystack = "0.1" matches = [[0, 3]] [[test]] name = "float2" regex = '[-+]?[0-9]*\.?[0-9]+' haystack = "0.1.2" matches = [[0, 3]] match-limit = 1 [[test]] name = "float3" regex = '[-+]?[0-9]*\.?[0-9]+' haystack = "a1.2" matches = [[1, 4]] [[test]] name = "float4" regex = '[-+]?[0-9]*\.?[0-9]+' haystack = "1.a" matches = [[0, 1]] [[test]] name = "float5" regex = '^[-+]?[0-9]*\.?[0-9]+$' haystack = "1.a" matches = [] [[test]] name = "email" regex = '(?i-u)\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b' haystack = "mine is jam.slam@gmail.com " matches = [[8, 26]] [[test]] name = "email-not" regex = '(?i-u)\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b' haystack = "mine is jam.slam@gmail " matches = [] [[test]] name = "email-big" regex = '''[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*@(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?''' haystack = "mine is jam.slam@gmail.com " matches = [[8, 26]] [[test]] name = "date1" regex = '^(?:19|20)\d\d[- /.](?:0[1-9]|1[012])[- /.](?:0[1-9]|[12][0-9]|3[01])$' haystack = "1900-01-01" matches = [[0, 10]] unicode = false [[test]] name = "date2" regex = '^(?:19|20)\d\d[- /.](?:0[1-9]|1[012])[- /.](?:0[1-9]|[12][0-9]|3[01])$' haystack = "1900-00-01" matches = [] unicode = false [[test]] name = "date3" regex = '^(?:19|20)\d\d[- /.](?:0[1-9]|1[012])[- /.](?:0[1-9]|[12][0-9]|3[01])$' haystack = "1900-13-01" matches = [] unicode = false [[test]] name = "start-end-empty" regex = '^$' haystack = "" matches = [[0, 0]] [[test]] name = "start-end-empty-rev" regex = '$^' haystack = "" matches = [[0, 0]] [[test]] name = "start-end-empty-many-1" regex = '^$^$^$' haystack = "" matches = [[0, 0]] [[test]] name = "start-end-empty-many-2" regex = '^^^$$$' haystack = "" matches = [[0, 0]] [[test]] name = "start-end-empty-rep" regex = '(?:^$)*' haystack = "a\nb\nc" matches = [[0, 0], [1, 1], [2, 2], [3, 3], [4, 4], [5, 5]] [[test]] name = "start-end-empty-rep-rev" regex = '(?:$^)*' haystack = "a\nb\nc" matches = [[0, 0], [1, 1], [2, 2], [3, 3], [4, 4], [5, 5]] [[test]] name = "neg-class-letter" regex = '[^ac]' haystack = "acx" matches = [[2, 3]] [[test]] name = "neg-class-letter-comma" regex = '[^a,]' haystack = "a,x" matches = [[2, 3]] [[test]] name = "neg-class-letter-space" regex = '[^a[:space:]]' haystack = "a x" matches = [[2, 3]] [[test]] name = "neg-class-comma" regex = '[^,]' haystack = ",,x" matches = [[2, 3]] [[test]] name = "neg-class-space" regex = '[^[:space:]]' haystack = " a" matches = [[1, 2]] [[test]] name = "neg-class-space-comma" regex = '[^,[:space:]]' haystack = ", a" matches = [[2, 3]] [[test]] name = "neg-class-comma-space" regex = '[^[:space:],]' haystack = " ,a" matches = [[2, 3]] [[test]] name = "neg-class-ascii" regex = '[^[:alpha:]Z]' haystack = "A1" matches = [[1, 2]] [[test]] name = "lazy-many-many" regex = '(?:(?:.*)*?)=' haystack = "a=b" matches = [[0, 2]] [[test]] name = "lazy-many-optional" regex = '(?:(?:.?)*?)=' haystack = "a=b" matches = [[0, 2]] [[test]] name = "lazy-one-many-many" regex = '(?:(?:.*)+?)=' haystack = "a=b" matches = [[0, 2]] [[test]] name = "lazy-one-many-optional" regex = '(?:(?:.?)+?)=' haystack = "a=b" matches = [[0, 2]] [[test]] name = "lazy-range-min-many" regex = '(?:(?:.*){1,}?)=' haystack = "a=b" matches = [[0, 2]] [[test]] name = "lazy-range-many" regex = '(?:(?:.*){1,2}?)=' haystack = "a=b" matches = [[0, 2]] [[test]] name = "greedy-many-many" regex = '(?:(?:.*)*)=' haystack = "a=b" matches = [[0, 2]] [[test]] name = "greedy-many-optional" regex = '(?:(?:.?)*)=' haystack = "a=b" matches = [[0, 2]] [[test]] name = "greedy-one-many-many" regex = '(?:(?:.*)+)=' haystack = "a=b" matches = [[0, 2]] [[test]] name = "greedy-one-many-optional" regex = '(?:(?:.?)+)=' haystack = "a=b" matches = [[0, 2]] [[test]] name = "greedy-range-min-many" regex = '(?:(?:.*){1,})=' haystack = "a=b" matches = [[0, 2]] [[test]] name = "greedy-range-many" regex = '(?:(?:.*){1,2})=' haystack = "a=b" matches = [[0, 2]] [[test]] name = "empty1" regex = '' haystack = "" matches = [[0, 0]] [[test]] name = "empty2" regex = '' haystack = "abc" matches = [[0, 0], [1, 1], [2, 2], [3, 3]] [[test]] name = "empty3" regex = '(?:)' haystack = "abc" matches = [[0, 0], [1, 1], [2, 2], [3, 3]] [[test]] name = "empty4" regex = '(?:)*' haystack = "abc" matches = [[0, 0], [1, 1], [2, 2], [3, 3]] [[test]] name = "empty5" regex = '(?:)+' haystack = "abc" matches = [[0, 0], [1, 1], [2, 2], [3, 3]] [[test]] name = "empty6" regex = '(?:)?' haystack = "abc" matches = [[0, 0], [1, 1], [2, 2], [3, 3]] [[test]] name = "empty7" regex = '(?:)(?:)' haystack = "abc" matches = [[0, 0], [1, 1], [2, 2], [3, 3]] [[test]] name = "empty8" regex = '(?:)+|z' haystack = "abc" matches = [[0, 0], [1, 1], [2, 2], [3, 3]] [[test]] name = "empty9" regex = 'z|(?:)+' haystack = "abc" matches = [[0, 0], [1, 1], [2, 2], [3, 3]] [[test]] name = "empty10" regex = '(?:)+|b' haystack = "abc" matches = [[0, 0], [1, 1], [2, 2], [3, 3]] [[test]] name = "empty11" regex = 'b|(?:)+' haystack = "abc" matches = [[0, 0], [1, 2], [3, 3]] �������������������������������������������������������regex-1.12.2/testdata/crlf.toml���������������������������������������������������������������������0000644�0000000�0000000�00000006606�10461020230�0014465�0����������������������������������������������������������������������������������������������������ustar �����������������������������������������������������������������0000000�0000000������������������������������������������������������������������������������������������������������������������������������������������������������������������������# This is a basic test that checks ^ and $ treat \r\n as a single line # terminator. If ^ and $ only treated \n as a line terminator, then this would # only match 'xyz' at the end of the haystack. [[test]] name = "basic" regex = '(?mR)^[a-z]+$' haystack = "abc\r\ndef\r\nxyz" matches = [[0, 3], [5, 8], [10, 13]] # Tests that a CRLF-aware '^$' assertion does not match between CR and LF. [[test]] name = "start-end-non-empty" regex = '(?mR)^$' haystack = "abc\r\ndef\r\nxyz" matches = [] # Tests that a CRLF-aware '^$' assertion matches the empty string, just like # a non-CRLF-aware '^$' assertion. [[test]] name = "start-end-empty" regex = '(?mR)^$' haystack = "" matches = [[0, 0]] # Tests that a CRLF-aware '^$' assertion matches the empty string preceding # and following a line terminator. [[test]] name = "start-end-before-after" regex = '(?mR)^$' haystack = "\r\n" matches = [[0, 0], [2, 2]] # Tests that a CRLF-aware '^' assertion does not split a line terminator. [[test]] name = "start-no-split" regex = '(?mR)^' haystack = "abc\r\ndef\r\nxyz" matches = [[0, 0], [5, 5], [10, 10]] # Same as above, but with adjacent runs of line terminators. [[test]] name = "start-no-split-adjacent" regex = '(?mR)^' haystack = "\r\n\r\n\r\n" matches = [[0, 0], [2, 2], [4, 4], [6, 6]] # Same as above, but with adjacent runs of just carriage returns. [[test]] name = "start-no-split-adjacent-cr" regex = '(?mR)^' haystack = "\r\r\r" matches = [[0, 0], [1, 1], [2, 2], [3, 3]] # Same as above, but with adjacent runs of just line feeds. [[test]] name = "start-no-split-adjacent-lf" regex = '(?mR)^' haystack = "\n\n\n" matches = [[0, 0], [1, 1], [2, 2], [3, 3]] # Tests that a CRLF-aware '$' assertion does not split a line terminator. [[test]] name = "end-no-split" regex = '(?mR)$' haystack = "abc\r\ndef\r\nxyz" matches = [[3, 3], [8, 8], [13, 13]] # Same as above, but with adjacent runs of line terminators. [[test]] name = "end-no-split-adjacent" regex = '(?mR)$' haystack = "\r\n\r\n\r\n" matches = [[0, 0], [2, 2], [4, 4], [6, 6]] # Same as above, but with adjacent runs of just carriage returns. [[test]] name = "end-no-split-adjacent-cr" regex = '(?mR)$' haystack = "\r\r\r" matches = [[0, 0], [1, 1], [2, 2], [3, 3]] # Same as above, but with adjacent runs of just line feeds. [[test]] name = "end-no-split-adjacent-lf" regex = '(?mR)$' haystack = "\n\n\n" matches = [[0, 0], [1, 1], [2, 2], [3, 3]] # Tests that '.' does not match either \r or \n when CRLF mode is enabled. Note # that this doesn't require multi-line mode to be enabled. [[test]] name = "dot-no-crlf" regex = '(?R).' haystack = "\r\n\r\n\r\n" matches = [] # This is a test that caught a bug in the one-pass DFA where it (amazingly) was # using 'is_end_lf' instead of 'is_end_crlf' here. It was probably a copy & # paste bug. We insert an empty capture group here because it provokes the meta # regex engine to first find a match and then trip over a panic because the # one-pass DFA erroneously says there is no match. [[test]] name = "onepass-wrong-crlf-with-capture" regex = '(?Rm:().$)' haystack = "ZZ\r" matches = [[[1, 2], [1, 1]]] # This is like onepass-wrong-crlf-with-capture above, except it sets up the # test so that it can be run by the one-pass DFA directly. (i.e., Make it # anchored and start the search at the right place.) [[test]] name = "onepass-wrong-crlf-anchored" regex = '(?Rm:.$)' haystack = "ZZ\r" matches = [[1, 2]] anchored = true bounds = [1, 3] ��������������������������������������������������������������������������������������������������������������������������regex-1.12.2/testdata/earliest.toml�����������������������������������������������������������������0000644�0000000�0000000�00000001534�10461020230�0015342�0����������������������������������������������������������������������������������������������������ustar �����������������������������������������������������������������0000000�0000000������������������������������������������������������������������������������������������������������������������������������������������������������������������������[[test]] name = "no-greedy-100" regex = 'a+' haystack = "aaa" matches = [[0, 1], [1, 2], [2, 3]] search-kind = "earliest" [[test]] name = "no-greedy-200" regex = 'abc+' haystack = "zzzabccc" matches = [[3, 6]] search-kind = "earliest" [[test]] name = "is-ungreedy" regex = 'a+?' haystack = "aaa" matches = [[0, 1], [1, 2], [2, 3]] search-kind = "earliest" [[test]] name = "look-start-test" regex = '^(abc|a)' haystack = "abc" matches = [ [[0, 1], [0, 1]], ] search-kind = "earliest" [[test]] name = "look-end-test" regex = '(abc|a)$' haystack = "abc" matches = [ [[0, 3], [0, 3]], ] search-kind = "earliest" [[test]] name = "no-leftmost-first-100" regex = 'abc|a' haystack = "abc" matches = [[0, 1]] search-kind = "earliest" [[test]] name = "no-leftmost-first-200" regex = 'aba|a' haystack = "aba" matches = [[0, 1], [2, 3]] search-kind = "earliest" ��������������������������������������������������������������������������������������������������������������������������������������������������������������������regex-1.12.2/testdata/empty.toml��������������������������������������������������������������������0000644�0000000�0000000�00000003345�10461020230�0014672�0����������������������������������������������������������������������������������������������������ustar �����������������������������������������������������������������0000000�0000000������������������������������������������������������������������������������������������������������������������������������������������������������������������������[[test]] name = "100" regex = "|b" haystack = "abc" matches = [[0, 0], [1, 1], [2, 2], [3, 3]] [[test]] name = "110" regex = "b|" haystack = "abc" matches = [[0, 0], [1, 2], [3, 3]] [[test]] name = "120" regex = "|z" haystack = "abc" matches = [[0, 0], [1, 1], [2, 2], [3, 3]] [[test]] name = "130" regex = "z|" haystack = "abc" matches = [[0, 0], [1, 1], [2, 2], [3, 3]] [[test]] name = "200" regex = "|" haystack = "abc" matches = [[0, 0], [1, 1], [2, 2], [3, 3]] [[test]] name = "210" regex = "||" haystack = "abc" matches = [[0, 0], [1, 1], [2, 2], [3, 3]] [[test]] name = "220" regex = "||b" haystack = "abc" matches = [[0, 0], [1, 1], [2, 2], [3, 3]] [[test]] name = "230" regex = "b||" haystack = "abc" matches = [[0, 0], [1, 2], [3, 3]] [[test]] name = "240" regex = "||z" haystack = "abc" matches = [[0, 0], [1, 1], [2, 2], [3, 3]] [[test]] name = "300" regex = "(?:)|b" haystack = "abc" matches = [[0, 0], [1, 1], [2, 2], [3, 3]] [[test]] name = "310" regex = "b|(?:)" haystack = "abc" matches = [[0, 0], [1, 2], [3, 3]] [[test]] name = "320" regex = "(?:|)" haystack = "abc" matches = [[0, 0], [1, 1], [2, 2], [3, 3]] [[test]] name = "330" regex = "(?:|)|z" haystack = "abc" matches = [[0, 0], [1, 1], [2, 2], [3, 3]] [[test]] name = "400" regex = "a(?:)|b" haystack = "abc" matches = [[0, 1], [1, 2]] [[test]] name = "500" regex = "" haystack = "" matches = [[0, 0]] [[test]] name = "510" regex = "" haystack = "a" matches = [[0, 0], [1, 1]] [[test]] name = "520" regex = "" haystack = "abc" matches = [[0, 0], [1, 1], [2, 2], [3, 3]] [[test]] name = "600" regex = '(?:|a)*' haystack = "aaa" matches = [[0, 0], [1, 1], [2, 2], [3, 3]] [[test]] name = "610" regex = '(?:|a)+' haystack = "aaa" matches = [[0, 0], [1, 1], [2, 2], [3, 3]] �������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������regex-1.12.2/testdata/expensive.toml����������������������������������������������������������������0000644�0000000�0000000�00000002162�10461020230�0015536�0����������������������������������������������������������������������������������������������������ustar �����������������������������������������������������������������0000000�0000000������������������������������������������������������������������������������������������������������������������������������������������������������������������������# This file represent tests that may be expensive to run on some regex engines. # For example, tests that build a full DFA ahead of time and minimize it can # take a horrendously long time on regexes that are large (or result in an # explosion in the number of states). We group these tests together so that # such engines can simply skip these tests. # See: https://github.com/rust-lang/regex/issues/98 [[test]] name = "regression-many-repeat-no-stack-overflow" regex = '^.{1,2500}' haystack = "a" matches = [[0, 1]] # This test is meant to blow the bounded backtracker's visited capacity. In # order to do that, we need a somewhat sizeable regex. The purpose of this # is to make sure there's at least one test that exercises this path in the # backtracker. All other tests (at time of writing) are small enough that the # backtracker can handle them fine. [[test]] name = "backtrack-blow-visited-capacity" regex = '\pL{50}' haystack = "abcdefghijklmnopqrstuvwxyabcdefghijklmnopqrstuvwxyabcdefghijklmnopqrstuvwxyabcdefghijklmnopqrstuvwxyabcdefghijklmnopqrstuvwxyabcdefghijklmnopqrstuvwxyZZ" matches = [[0, 50], [50, 100], [100, 150]] ��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������regex-1.12.2/testdata/flags.toml��������������������������������������������������������������������0000644�0000000�0000000�00000001637�10461020230�0014632�0����������������������������������������������������������������������������������������������������ustar �����������������������������������������������������������������0000000�0000000������������������������������������������������������������������������������������������������������������������������������������������������������������������������[[test]] name = "1" regex = "(?i)abc" haystack = "ABC" matches = [[0, 3]] [[test]] name = "2" regex = "(?i)a(?-i)bc" haystack = "Abc" matches = [[0, 3]] [[test]] name = "3" regex = "(?i)a(?-i)bc" haystack = "ABC" matches = [] [[test]] name = "4" regex = "(?is)a." haystack = "A\n" matches = [[0, 2]] [[test]] name = "5" regex = "(?is)a.(?-is)a." haystack = "A\nab" matches = [[0, 4]] [[test]] name = "6" regex = "(?is)a.(?-is)a." haystack = "A\na\n" matches = [] [[test]] name = "7" regex = "(?is)a.(?-is:a.)?" haystack = "A\na\n" matches = [[0, 2]] match-limit = 1 [[test]] name = "8" regex = "(?U)a+" haystack = "aa" matches = [[0, 1]] match-limit = 1 [[test]] name = "9" regex = "(?U)a+?" haystack = "aa" matches = [[0, 2]] [[test]] name = "10" regex = "(?U)(?-U)a+" haystack = "aa" matches = [[0, 2]] [[test]] name = "11" regex = '(?m)(?:^\d+$\n?)+' haystack = "123\n456\n789" matches = [[0, 11]] unicode = false �������������������������������������������������������������������������������������������������regex-1.12.2/testdata/fowler/basic.toml�������������������������������������������������������������0000644�0000000�0000000�00000071336�10461020230�0016120�0����������������������������������������������������������������������������������������������������ustar �����������������������������������������������������������������0000000�0000000������������������������������������������������������������������������������������������������������������������������������������������������������������������������# !!! DO NOT EDIT !!! # Automatically generated by 'regex-cli generate fowler'. # Numbers in the test names correspond to the line number of the test from # the original dat file. [[test]] name = "basic3" regex = '''abracadabra$''' haystack = '''abracadabracadabra''' matches = [[[7, 18]]] match-limit = 1 [[test]] name = "basic4" regex = '''a...b''' haystack = '''abababbb''' matches = [[[2, 7]]] match-limit = 1 [[test]] name = "basic5" regex = '''XXXXXX''' haystack = '''..XXXXXX''' matches = [[[2, 8]]] match-limit = 1 [[test]] name = "basic6" regex = '''\)''' haystack = '''()''' matches = [[[1, 2]]] match-limit = 1 [[test]] name = "basic7" regex = '''a]''' haystack = '''a]a''' matches = [[[0, 2]]] match-limit = 1 anchored = true [[test]] name = "basic9" regex = '''\}''' haystack = '''}''' matches = [[[0, 1]]] match-limit = 1 anchored = true [[test]] name = "basic10" regex = '''\]''' haystack = ''']''' matches = [[[0, 1]]] match-limit = 1 anchored = true [[test]] name = "basic12" regex = ''']''' haystack = ''']''' matches = [[[0, 1]]] match-limit = 1 anchored = true [[test]] name = "basic15" regex = '''^a''' haystack = '''ax''' matches = [[[0, 1]]] match-limit = 1 anchored = true [[test]] name = "basic16" regex = '''\^a''' haystack = '''a^a''' matches = [[[1, 3]]] match-limit = 1 [[test]] name = "basic17" regex = '''a\^''' haystack = '''a^''' matches = [[[0, 2]]] match-limit = 1 anchored = true [[test]] name = "basic18" regex = '''a$''' haystack = '''aa''' matches = [[[1, 2]]] match-limit = 1 [[test]] name = "basic19" regex = '''a\$''' haystack = '''a$''' matches = [[[0, 2]]] match-limit = 1 anchored = true [[test]] name = "basic20" regex = '''^$''' haystack = '''''' matches = [[[0, 0]]] match-limit = 1 anchored = true [[test]] name = "basic21" regex = '''$^''' haystack = '''''' matches = [[[0, 0]]] match-limit = 1 anchored = true [[test]] name = "basic22" regex = '''a($)''' haystack = '''aa''' matches = [[[1, 2], [2, 2]]] match-limit = 1 [[test]] name = "basic23" regex = '''a*(^a)''' haystack = '''aa''' matches = [[[0, 1], [0, 1]]] match-limit = 1 anchored = true [[test]] name = "basic24" regex = '''(..)*(...)*''' haystack = '''a''' matches = [[[0, 0], [], []]] match-limit = 1 anchored = true [[test]] name = "basic25" regex = '''(..)*(...)*''' haystack = '''abcd''' matches = [[[0, 4], [2, 4], []]] match-limit = 1 anchored = true [[test]] name = "basic26" regex = '''(ab|a)(bc|c)''' haystack = '''abc''' matches = [[[0, 3], [0, 2], [2, 3]]] match-limit = 1 anchored = true [[test]] name = "basic27" regex = '''(ab)c|abc''' haystack = '''abc''' matches = [[[0, 3], [0, 2]]] match-limit = 1 anchored = true [[test]] name = "basic28" regex = '''a{0}b''' haystack = '''ab''' matches = [[[1, 2]]] match-limit = 1 [[test]] name = "basic29" regex = '''(a*)(b?)(b+)b{3}''' haystack = '''aaabbbbbbb''' matches = [[[0, 10], [0, 3], [3, 4], [4, 7]]] match-limit = 1 anchored = true [[test]] name = "basic30" regex = '''(a*)(b{0,1})(b{1,})b{3}''' haystack = '''aaabbbbbbb''' matches = [[[0, 10], [0, 3], [3, 4], [4, 7]]] match-limit = 1 anchored = true [[test]] name = "basic32" regex = '''((a|a)|a)''' haystack = '''a''' matches = [[[0, 1], [0, 1], [0, 1]]] match-limit = 1 anchored = true [[test]] name = "basic33" regex = '''(a*)(a|aa)''' haystack = '''aaaa''' matches = [[[0, 4], [0, 3], [3, 4]]] match-limit = 1 anchored = true [[test]] name = "basic34" regex = '''a*(a.|aa)''' haystack = '''aaaa''' matches = [[[0, 4], [2, 4]]] match-limit = 1 anchored = true [[test]] name = "basic35" regex = '''a(b)|c(d)|a(e)f''' haystack = '''aef''' matches = [[[0, 3], [], [], [1, 2]]] match-limit = 1 anchored = true [[test]] name = "basic36" regex = '''(a|b)?.*''' haystack = '''b''' matches = [[[0, 1], [0, 1]]] match-limit = 1 anchored = true [[test]] name = "basic37" regex = '''(a|b)c|a(b|c)''' haystack = '''ac''' matches = [[[0, 2], [0, 1], []]] match-limit = 1 anchored = true [[test]] name = "basic38" regex = '''(a|b)c|a(b|c)''' haystack = '''ab''' matches = [[[0, 2], [], [1, 2]]] match-limit = 1 anchored = true [[test]] name = "basic39" regex = '''(a|b)*c|(a|ab)*c''' haystack = '''abc''' matches = [[[0, 3], [1, 2], []]] match-limit = 1 anchored = true [[test]] name = "basic40" regex = '''(a|b)*c|(a|ab)*c''' haystack = '''xc''' matches = [[[1, 2], [], []]] match-limit = 1 [[test]] name = "basic41" regex = '''(.a|.b).*|.*(.a|.b)''' haystack = '''xa''' matches = [[[0, 2], [0, 2], []]] match-limit = 1 anchored = true [[test]] name = "basic42" regex = '''a?(ab|ba)ab''' haystack = '''abab''' matches = [[[0, 4], [0, 2]]] match-limit = 1 anchored = true [[test]] name = "basic43" regex = '''a?(ac{0}b|ba)ab''' haystack = '''abab''' matches = [[[0, 4], [0, 2]]] match-limit = 1 anchored = true [[test]] name = "basic44" regex = '''ab|abab''' haystack = '''abbabab''' matches = [[[0, 2]]] match-limit = 1 anchored = true [[test]] name = "basic45" regex = '''aba|bab|bba''' haystack = '''baaabbbaba''' matches = [[[5, 8]]] match-limit = 1 [[test]] name = "basic46" regex = '''aba|bab''' haystack = '''baaabbbaba''' matches = [[[6, 9]]] match-limit = 1 [[test]] name = "basic47" regex = '''(aa|aaa)*|(a|aaaaa)''' haystack = '''aa''' matches = [[[0, 2], [0, 2], []]] match-limit = 1 anchored = true [[test]] name = "basic48" regex = '''(a.|.a.)*|(a|.a...)''' haystack = '''aa''' matches = [[[0, 2], [0, 2], []]] match-limit = 1 anchored = true [[test]] name = "basic49" regex = '''ab|a''' haystack = '''xabc''' matches = [[[1, 3]]] match-limit = 1 [[test]] name = "basic50" regex = '''ab|a''' haystack = '''xxabc''' matches = [[[2, 4]]] match-limit = 1 [[test]] name = "basic51" regex = '''(Ab|cD)*''' haystack = '''aBcD''' matches = [[[0, 4], [2, 4]]] match-limit = 1 anchored = true case-insensitive = true [[test]] name = "basic52" regex = '''[^-]''' haystack = '''--a''' matches = [[[2, 3]]] match-limit = 1 [[test]] name = "basic53" regex = '''[a-]*''' haystack = '''--a''' matches = [[[0, 3]]] match-limit = 1 anchored = true [[test]] name = "basic54" regex = '''[a-m-]*''' haystack = '''--amoma--''' matches = [[[0, 4]]] match-limit = 1 anchored = true [[test]] name = "basic55" regex = ''':::1:::0:|:::1:1:0:''' haystack = ''':::0:::1:::1:::0:''' matches = [[[8, 17]]] match-limit = 1 [[test]] name = "basic56" regex = ''':::1:::0:|:::1:1:1:''' haystack = ''':::0:::1:::1:::0:''' matches = [[[8, 17]]] match-limit = 1 [[test]] name = "basic57" regex = '''[[:upper:]]''' haystack = '''A''' matches = [[[0, 1]]] match-limit = 1 anchored = true [[test]] name = "basic58" regex = '''[[:lower:]]+''' haystack = '''`az{''' matches = [[[1, 3]]] match-limit = 1 [[test]] name = "basic59" regex = '''[[:upper:]]+''' haystack = '''@AZ[''' matches = [[[1, 3]]] match-limit = 1 [[test]] name = "basic65" regex = '''\n''' haystack = '''\n''' matches = [[[0, 1]]] match-limit = 1 anchored = true unescape = true [[test]] name = "basic66" regex = '''\n''' haystack = '''\n''' matches = [[[0, 1]]] match-limit = 1 anchored = true unescape = true [[test]] name = "basic67" regex = '''[^a]''' haystack = '''\n''' matches = [[[0, 1]]] match-limit = 1 anchored = true unescape = true [[test]] name = "basic68" regex = '''\na''' haystack = '''\na''' matches = [[[0, 2]]] match-limit = 1 anchored = true unescape = true [[test]] name = "basic69" regex = '''(a)(b)(c)''' haystack = '''abc''' matches = [[[0, 3], [0, 1], [1, 2], [2, 3]]] match-limit = 1 anchored = true [[test]] name = "basic70" regex = '''xxx''' haystack = '''xxx''' matches = [[[0, 3]]] match-limit = 1 anchored = true # Test added by Rust regex project. [[test]] name = "basic72" regex = '''(?:^|[ (,;])(?:(?:(?:[Ff]eb[^ ]* *|0*2/|\* */?)0*[6-7]))(?:[^0-9]|$)''' haystack = '''feb 6,''' matches = [[[0, 6]]] match-limit = 1 anchored = true # Test added by Rust regex project. [[test]] name = "basic74" regex = '''(?:^|[ (,;])(?:(?:(?:[Ff]eb[^ ]* *|0*2/|\* */?)0*[6-7]))(?:[^0-9]|$)''' haystack = '''2/7''' matches = [[[0, 3]]] match-limit = 1 anchored = true # Test added by Rust regex project. [[test]] name = "basic76" regex = '''(?:^|[ (,;])(?:(?:(?:[Ff]eb[^ ]* *|0*2/|\* */?)0*[6-7]))(?:[^0-9]|$)''' haystack = '''feb 1,Feb 6''' matches = [[[5, 11]]] match-limit = 1 # Test added by Rust regex project. [[test]] name = "basic78" regex = '''(((?:(?:(?:(?:(?:(?:(?:(?:(?:(?:(?:(?:(?:(?:(?:(?:(?:(?:(?:(?:(?:(?:(?:(?:(?:(?:(?:(?:x))))))))))))))))))))))))))))))''' haystack = '''x''' matches = [[[0, 1], [0, 1], [0, 1]]] match-limit = 1 anchored = true # Test added by Rust regex project. [[test]] name = "basic80" regex = '''(((?:(?:(?:(?:(?:(?:(?:(?:(?:(?:(?:(?:(?:(?:(?:(?:(?:(?:(?:(?:(?:(?:(?:(?:(?:(?:(?:(?:x))))))))))))))))))))))))))))))*''' haystack = '''xx''' matches = [[[0, 2], [1, 2], [1, 2]]] match-limit = 1 anchored = true [[test]] name = "basic81" regex = '''a?(ab|ba)*''' haystack = '''ababababababababababababababababababababababababababababababababababababababababa''' matches = [[[0, 81], [79, 81]]] match-limit = 1 anchored = true [[test]] name = "basic82" regex = '''abaa|abbaa|abbbaa|abbbbaa''' haystack = '''ababbabbbabbbabbbbabbbbaa''' matches = [[[18, 25]]] match-limit = 1 [[test]] name = "basic83" regex = '''abaa|abbaa|abbbaa|abbbbaa''' haystack = '''ababbabbbabbbabbbbabaa''' matches = [[[18, 22]]] match-limit = 1 [[test]] name = "basic84" regex = '''aaac|aabc|abac|abbc|baac|babc|bbac|bbbc''' haystack = '''baaabbbabac''' matches = [[[7, 11]]] match-limit = 1 # Test added by Rust regex project. [[test]] name = "basic86" regex = '''.*''' haystack = '''\x01\x7f''' matches = [[[0, 2]]] match-limit = 1 anchored = true unescape = true [[test]] name = "basic87" regex = '''aaaa|bbbb|cccc|ddddd|eeeeee|fffffff|gggg|hhhh|iiiii|jjjjj|kkkkk|llll''' haystack = '''XaaaXbbbXcccXdddXeeeXfffXgggXhhhXiiiXjjjXkkkXlllXcbaXaaaa''' matches = [[[53, 57]]] match-limit = 1 [[test]] name = "basic89" regex = '''a*a*a*a*a*b''' haystack = '''aaaaaaaaab''' matches = [[[0, 10]]] match-limit = 1 anchored = true [[test]] name = "basic90" regex = '''^''' haystack = '''''' matches = [[[0, 0]]] match-limit = 1 anchored = true [[test]] name = "basic91" regex = '''$''' haystack = '''''' matches = [[[0, 0]]] match-limit = 1 anchored = true [[test]] name = "basic92" regex = '''^$''' haystack = '''''' matches = [[[0, 0]]] match-limit = 1 anchored = true [[test]] name = "basic93" regex = '''^a$''' haystack = '''a''' matches = [[[0, 1]]] match-limit = 1 anchored = true [[test]] name = "basic94" regex = '''abc''' haystack = '''abc''' matches = [[[0, 3]]] match-limit = 1 anchored = true [[test]] name = "basic95" regex = '''abc''' haystack = '''xabcy''' matches = [[[1, 4]]] match-limit = 1 [[test]] name = "basic96" regex = '''abc''' haystack = '''ababc''' matches = [[[2, 5]]] match-limit = 1 [[test]] name = "basic97" regex = '''ab*c''' haystack = '''abc''' matches = [[[0, 3]]] match-limit = 1 anchored = true [[test]] name = "basic98" regex = '''ab*bc''' haystack = '''abc''' matches = [[[0, 3]]] match-limit = 1 anchored = true [[test]] name = "basic99" regex = '''ab*bc''' haystack = '''abbc''' matches = [[[0, 4]]] match-limit = 1 anchored = true [[test]] name = "basic100" regex = '''ab*bc''' haystack = '''abbbbc''' matches = [[[0, 6]]] match-limit = 1 anchored = true [[test]] name = "basic101" regex = '''ab+bc''' haystack = '''abbc''' matches = [[[0, 4]]] match-limit = 1 anchored = true [[test]] name = "basic102" regex = '''ab+bc''' haystack = '''abbbbc''' matches = [[[0, 6]]] match-limit = 1 anchored = true [[test]] name = "basic103" regex = '''ab?bc''' haystack = '''abbc''' matches = [[[0, 4]]] match-limit = 1 anchored = true [[test]] name = "basic104" regex = '''ab?bc''' haystack = '''abc''' matches = [[[0, 3]]] match-limit = 1 anchored = true [[test]] name = "basic105" regex = '''ab?c''' haystack = '''abc''' matches = [[[0, 3]]] match-limit = 1 anchored = true [[test]] name = "basic106" regex = '''^abc$''' haystack = '''abc''' matches = [[[0, 3]]] match-limit = 1 anchored = true [[test]] name = "basic107" regex = '''^abc''' haystack = '''abcc''' matches = [[[0, 3]]] match-limit = 1 anchored = true [[test]] name = "basic108" regex = '''abc$''' haystack = '''aabc''' matches = [[[1, 4]]] match-limit = 1 [[test]] name = "basic109" regex = '''^''' haystack = '''abc''' matches = [[[0, 0]]] match-limit = 1 anchored = true [[test]] name = "basic110" regex = '''$''' haystack = '''abc''' matches = [[[3, 3]]] match-limit = 1 [[test]] name = "basic111" regex = '''a.c''' haystack = '''abc''' matches = [[[0, 3]]] match-limit = 1 anchored = true [[test]] name = "basic112" regex = '''a.c''' haystack = '''axc''' matches = [[[0, 3]]] match-limit = 1 anchored = true [[test]] name = "basic113" regex = '''a.*c''' haystack = '''axyzc''' matches = [[[0, 5]]] match-limit = 1 anchored = true [[test]] name = "basic114" regex = '''a[bc]d''' haystack = '''abd''' matches = [[[0, 3]]] match-limit = 1 anchored = true [[test]] name = "basic115" regex = '''a[b-d]e''' haystack = '''ace''' matches = [[[0, 3]]] match-limit = 1 anchored = true [[test]] name = "basic116" regex = '''a[b-d]''' haystack = '''aac''' matches = [[[1, 3]]] match-limit = 1 [[test]] name = "basic117" regex = '''a[-b]''' haystack = '''a-''' matches = [[[0, 2]]] match-limit = 1 anchored = true [[test]] name = "basic118" regex = '''a[b-]''' haystack = '''a-''' matches = [[[0, 2]]] match-limit = 1 anchored = true [[test]] name = "basic119" regex = '''a]''' haystack = '''a]''' matches = [[[0, 2]]] match-limit = 1 anchored = true [[test]] name = "basic120" regex = '''a[]]b''' haystack = '''a]b''' matches = [[[0, 3]]] match-limit = 1 anchored = true [[test]] name = "basic121" regex = '''a[^bc]d''' haystack = '''aed''' matches = [[[0, 3]]] match-limit = 1 anchored = true [[test]] name = "basic122" regex = '''a[^-b]c''' haystack = '''adc''' matches = [[[0, 3]]] match-limit = 1 anchored = true [[test]] name = "basic123" regex = '''a[^]b]c''' haystack = '''adc''' matches = [[[0, 3]]] match-limit = 1 anchored = true [[test]] name = "basic124" regex = '''ab|cd''' haystack = '''abc''' matches = [[[0, 2]]] match-limit = 1 anchored = true [[test]] name = "basic125" regex = '''ab|cd''' haystack = '''abcd''' matches = [[[0, 2]]] match-limit = 1 anchored = true [[test]] name = "basic126" regex = '''a\(b''' haystack = '''a(b''' matches = [[[0, 3]]] match-limit = 1 anchored = true [[test]] name = "basic127" regex = '''a\(*b''' haystack = '''ab''' matches = [[[0, 2]]] match-limit = 1 anchored = true [[test]] name = "basic128" regex = '''a\(*b''' haystack = '''a((b''' matches = [[[0, 4]]] match-limit = 1 anchored = true [[test]] name = "basic129" regex = '''((a))''' haystack = '''abc''' matches = [[[0, 1], [0, 1], [0, 1]]] match-limit = 1 anchored = true [[test]] name = "basic130" regex = '''(a)b(c)''' haystack = '''abc''' matches = [[[0, 3], [0, 1], [2, 3]]] match-limit = 1 anchored = true [[test]] name = "basic131" regex = '''a+b+c''' haystack = '''aabbabc''' matches = [[[4, 7]]] match-limit = 1 [[test]] name = "basic132" regex = '''a*''' haystack = '''aaa''' matches = [[[0, 3]]] match-limit = 1 anchored = true [[test]] name = "basic133" regex = '''(a*)*''' haystack = '''-''' matches = [[[0, 0], [0, 0]]] match-limit = 1 anchored = true [[test]] name = "basic134" regex = '''(a*)+''' haystack = '''-''' matches = [[[0, 0], [0, 0]]] match-limit = 1 anchored = true [[test]] name = "basic135" regex = '''(a*|b)*''' haystack = '''-''' matches = [[[0, 0], [0, 0]]] match-limit = 1 anchored = true [[test]] name = "basic136" regex = '''(a+|b)*''' haystack = '''ab''' matches = [[[0, 2], [1, 2]]] match-limit = 1 anchored = true [[test]] name = "basic137" regex = '''(a+|b)+''' haystack = '''ab''' matches = [[[0, 2], [1, 2]]] match-limit = 1 anchored = true [[test]] name = "basic138" regex = '''(a+|b)?''' haystack = '''ab''' matches = [[[0, 1], [0, 1]]] match-limit = 1 anchored = true [[test]] name = "basic139" regex = '''[^ab]*''' haystack = '''cde''' matches = [[[0, 3]]] match-limit = 1 anchored = true [[test]] name = "basic140" regex = '''(^)*''' haystack = '''-''' matches = [[[0, 0], [0, 0]]] match-limit = 1 anchored = true [[test]] name = "basic141" regex = '''a*''' haystack = '''''' matches = [[[0, 0]]] match-limit = 1 anchored = true [[test]] name = "basic142" regex = '''([abc])*d''' haystack = '''abbbcd''' matches = [[[0, 6], [4, 5]]] match-limit = 1 anchored = true [[test]] name = "basic143" regex = '''([abc])*bcd''' haystack = '''abcd''' matches = [[[0, 4], [0, 1]]] match-limit = 1 anchored = true [[test]] name = "basic144" regex = '''a|b|c|d|e''' haystack = '''e''' matches = [[[0, 1]]] match-limit = 1 anchored = true [[test]] name = "basic145" regex = '''(a|b|c|d|e)f''' haystack = '''ef''' matches = [[[0, 2], [0, 1]]] match-limit = 1 anchored = true [[test]] name = "basic146" regex = '''((a*|b))*''' haystack = '''-''' matches = [[[0, 0], [0, 0], [0, 0]]] match-limit = 1 anchored = true [[test]] name = "basic147" regex = '''abcd*efg''' haystack = '''abcdefg''' matches = [[[0, 7]]] match-limit = 1 anchored = true [[test]] name = "basic148" regex = '''ab*''' haystack = '''xabyabbbz''' matches = [[[1, 3]]] match-limit = 1 [[test]] name = "basic149" regex = '''ab*''' haystack = '''xayabbbz''' matches = [[[1, 2]]] match-limit = 1 [[test]] name = "basic150" regex = '''(ab|cd)e''' haystack = '''abcde''' matches = [[[2, 5], [2, 4]]] match-limit = 1 [[test]] name = "basic151" regex = '''[abhgefdc]ij''' haystack = '''hij''' matches = [[[0, 3]]] match-limit = 1 anchored = true [[test]] name = "basic152" regex = '''(a|b)c*d''' haystack = '''abcd''' matches = [[[1, 4], [1, 2]]] match-limit = 1 [[test]] name = "basic153" regex = '''(ab|ab*)bc''' haystack = '''abc''' matches = [[[0, 3], [0, 1]]] match-limit = 1 anchored = true [[test]] name = "basic154" regex = '''a([bc]*)c*''' haystack = '''abc''' matches = [[[0, 3], [1, 3]]] match-limit = 1 anchored = true [[test]] name = "basic155" regex = '''a([bc]*)(c*d)''' haystack = '''abcd''' matches = [[[0, 4], [1, 3], [3, 4]]] match-limit = 1 anchored = true [[test]] name = "basic156" regex = '''a([bc]+)(c*d)''' haystack = '''abcd''' matches = [[[0, 4], [1, 3], [3, 4]]] match-limit = 1 anchored = true [[test]] name = "basic157" regex = '''a([bc]*)(c+d)''' haystack = '''abcd''' matches = [[[0, 4], [1, 2], [2, 4]]] match-limit = 1 anchored = true [[test]] name = "basic158" regex = '''a[bcd]*dcdcde''' haystack = '''adcdcde''' matches = [[[0, 7]]] match-limit = 1 anchored = true [[test]] name = "basic159" regex = '''(ab|a)b*c''' haystack = '''abc''' matches = [[[0, 3], [0, 2]]] match-limit = 1 anchored = true [[test]] name = "basic160" regex = '''((a)(b)c)(d)''' haystack = '''abcd''' matches = [[[0, 4], [0, 3], [0, 1], [1, 2], [3, 4]]] match-limit = 1 anchored = true [[test]] name = "basic161" regex = '''[A-Za-z_][A-Za-z0-9_]*''' haystack = '''alpha''' matches = [[[0, 5]]] match-limit = 1 anchored = true [[test]] name = "basic162" regex = '''^a(bc+|b[eh])g|.h$''' haystack = '''abh''' matches = [[[1, 3], []]] match-limit = 1 [[test]] name = "basic163" regex = '''(bc+d$|ef*g.|h?i(j|k))''' haystack = '''effgz''' matches = [[[0, 5], [0, 5], []]] match-limit = 1 anchored = true [[test]] name = "basic164" regex = '''(bc+d$|ef*g.|h?i(j|k))''' haystack = '''ij''' matches = [[[0, 2], [0, 2], [1, 2]]] match-limit = 1 anchored = true [[test]] name = "basic165" regex = '''(bc+d$|ef*g.|h?i(j|k))''' haystack = '''reffgz''' matches = [[[1, 6], [1, 6], []]] match-limit = 1 [[test]] name = "basic166" regex = '''(((((((((a)))))))))''' haystack = '''a''' matches = [[[0, 1], [0, 1], [0, 1], [0, 1], [0, 1], [0, 1], [0, 1], [0, 1], [0, 1], [0, 1]]] match-limit = 1 anchored = true [[test]] name = "basic167" regex = '''multiple words''' haystack = '''multiple words yeah''' matches = [[[0, 14]]] match-limit = 1 anchored = true [[test]] name = "basic168" regex = '''(.*)c(.*)''' haystack = '''abcde''' matches = [[[0, 5], [0, 2], [3, 5]]] match-limit = 1 anchored = true [[test]] name = "basic169" regex = '''abcd''' haystack = '''abcd''' matches = [[[0, 4]]] match-limit = 1 anchored = true [[test]] name = "basic170" regex = '''a(bc)d''' haystack = '''abcd''' matches = [[[0, 4], [1, 3]]] match-limit = 1 anchored = true [[test]] name = "basic171" regex = '''a[\x01-\x03]?c''' haystack = '''a\x02c''' matches = [[[0, 3]]] match-limit = 1 anchored = true unescape = true [[test]] name = "basic172" regex = '''M[ou]'?am+[ae]r .*([AEae]l[- ])?[GKQ]h?[aeu]+([dtz][dhz]?)+af[iy]''' haystack = '''Muammar Qaddafi''' matches = [[[0, 15], [], [10, 12]]] match-limit = 1 anchored = true [[test]] name = "basic173" regex = '''M[ou]'?am+[ae]r .*([AEae]l[- ])?[GKQ]h?[aeu]+([dtz][dhz]?)+af[iy]''' haystack = '''Mo'ammar Gadhafi''' matches = [[[0, 16], [], [11, 13]]] match-limit = 1 anchored = true [[test]] name = "basic174" regex = '''M[ou]'?am+[ae]r .*([AEae]l[- ])?[GKQ]h?[aeu]+([dtz][dhz]?)+af[iy]''' haystack = '''Muammar Kaddafi''' matches = [[[0, 15], [], [10, 12]]] match-limit = 1 anchored = true [[test]] name = "basic175" regex = '''M[ou]'?am+[ae]r .*([AEae]l[- ])?[GKQ]h?[aeu]+([dtz][dhz]?)+af[iy]''' haystack = '''Muammar Qadhafi''' matches = [[[0, 15], [], [10, 12]]] match-limit = 1 anchored = true [[test]] name = "basic176" regex = '''M[ou]'?am+[ae]r .*([AEae]l[- ])?[GKQ]h?[aeu]+([dtz][dhz]?)+af[iy]''' haystack = '''Muammar Gadafi''' matches = [[[0, 14], [], [10, 11]]] match-limit = 1 anchored = true [[test]] name = "basic177" regex = '''M[ou]'?am+[ae]r .*([AEae]l[- ])?[GKQ]h?[aeu]+([dtz][dhz]?)+af[iy]''' haystack = '''Mu'ammar Qadafi''' matches = [[[0, 15], [], [11, 12]]] match-limit = 1 anchored = true [[test]] name = "basic178" regex = '''M[ou]'?am+[ae]r .*([AEae]l[- ])?[GKQ]h?[aeu]+([dtz][dhz]?)+af[iy]''' haystack = '''Moamar Gaddafi''' matches = [[[0, 14], [], [9, 11]]] match-limit = 1 anchored = true [[test]] name = "basic179" regex = '''M[ou]'?am+[ae]r .*([AEae]l[- ])?[GKQ]h?[aeu]+([dtz][dhz]?)+af[iy]''' haystack = '''Mu'ammar Qadhdhafi''' matches = [[[0, 18], [], [13, 15]]] match-limit = 1 anchored = true [[test]] name = "basic180" regex = '''M[ou]'?am+[ae]r .*([AEae]l[- ])?[GKQ]h?[aeu]+([dtz][dhz]?)+af[iy]''' haystack = '''Muammar Khaddafi''' matches = [[[0, 16], [], [11, 13]]] match-limit = 1 anchored = true [[test]] name = "basic181" regex = '''M[ou]'?am+[ae]r .*([AEae]l[- ])?[GKQ]h?[aeu]+([dtz][dhz]?)+af[iy]''' haystack = '''Muammar Ghaddafy''' matches = [[[0, 16], [], [11, 13]]] match-limit = 1 anchored = true [[test]] name = "basic182" regex = '''M[ou]'?am+[ae]r .*([AEae]l[- ])?[GKQ]h?[aeu]+([dtz][dhz]?)+af[iy]''' haystack = '''Muammar Ghadafi''' matches = [[[0, 15], [], [11, 12]]] match-limit = 1 anchored = true [[test]] name = "basic183" regex = '''M[ou]'?am+[ae]r .*([AEae]l[- ])?[GKQ]h?[aeu]+([dtz][dhz]?)+af[iy]''' haystack = '''Muammar Ghaddafi''' matches = [[[0, 16], [], [11, 13]]] match-limit = 1 anchored = true [[test]] name = "basic184" regex = '''M[ou]'?am+[ae]r .*([AEae]l[- ])?[GKQ]h?[aeu]+([dtz][dhz]?)+af[iy]''' haystack = '''Muamar Kaddafi''' matches = [[[0, 14], [], [9, 11]]] match-limit = 1 anchored = true [[test]] name = "basic185" regex = '''M[ou]'?am+[ae]r .*([AEae]l[- ])?[GKQ]h?[aeu]+([dtz][dhz]?)+af[iy]''' haystack = '''Muammar Quathafi''' matches = [[[0, 16], [], [11, 13]]] match-limit = 1 anchored = true [[test]] name = "basic186" regex = '''M[ou]'?am+[ae]r .*([AEae]l[- ])?[GKQ]h?[aeu]+([dtz][dhz]?)+af[iy]''' haystack = '''Muammar Gheddafi''' matches = [[[0, 16], [], [11, 13]]] match-limit = 1 anchored = true [[test]] name = "basic187" regex = '''M[ou]'?am+[ae]r .*([AEae]l[- ])?[GKQ]h?[aeu]+([dtz][dhz]?)+af[iy]''' haystack = '''Moammar Khadafy''' matches = [[[0, 15], [], [11, 12]]] match-limit = 1 anchored = true [[test]] name = "basic188" regex = '''M[ou]'?am+[ae]r .*([AEae]l[- ])?[GKQ]h?[aeu]+([dtz][dhz]?)+af[iy]''' haystack = '''Moammar Qudhafi''' matches = [[[0, 15], [], [10, 12]]] match-limit = 1 anchored = true [[test]] name = "basic189" regex = '''a+(b|c)*d+''' haystack = '''aabcdd''' matches = [[[0, 6], [3, 4]]] match-limit = 1 anchored = true [[test]] name = "basic190" regex = '''^.+$''' haystack = '''vivi''' matches = [[[0, 4]]] match-limit = 1 anchored = true [[test]] name = "basic191" regex = '''^(.+)$''' haystack = '''vivi''' matches = [[[0, 4], [0, 4]]] match-limit = 1 anchored = true [[test]] name = "basic192" regex = '''^([^!.]+).att.com!(.+)$''' haystack = '''gryphon.att.com!eby''' matches = [[[0, 19], [0, 7], [16, 19]]] match-limit = 1 anchored = true [[test]] name = "basic193" regex = '''^([^!]+!)?([^!]+)$''' haystack = '''bas''' matches = [[[0, 3], [], [0, 3]]] match-limit = 1 anchored = true [[test]] name = "basic194" regex = '''^([^!]+!)?([^!]+)$''' haystack = '''bar!bas''' matches = [[[0, 7], [0, 4], [4, 7]]] match-limit = 1 anchored = true [[test]] name = "basic195" regex = '''^([^!]+!)?([^!]+)$''' haystack = '''foo!bas''' matches = [[[0, 7], [0, 4], [4, 7]]] match-limit = 1 anchored = true [[test]] name = "basic196" regex = '''^.+!([^!]+!)([^!]+)$''' haystack = '''foo!bar!bas''' matches = [[[0, 11], [4, 8], [8, 11]]] match-limit = 1 anchored = true [[test]] name = "basic197" regex = '''((foo)|(bar))!bas''' haystack = '''bar!bas''' matches = [[[0, 7], [0, 3], [], [0, 3]]] match-limit = 1 anchored = true [[test]] name = "basic198" regex = '''((foo)|(bar))!bas''' haystack = '''foo!bar!bas''' matches = [[[4, 11], [4, 7], [], [4, 7]]] match-limit = 1 [[test]] name = "basic199" regex = '''((foo)|(bar))!bas''' haystack = '''foo!bas''' matches = [[[0, 7], [0, 3], [0, 3], []]] match-limit = 1 anchored = true [[test]] name = "basic200" regex = '''((foo)|bar)!bas''' haystack = '''bar!bas''' matches = [[[0, 7], [0, 3], []]] match-limit = 1 anchored = true [[test]] name = "basic201" regex = '''((foo)|bar)!bas''' haystack = '''foo!bar!bas''' matches = [[[4, 11], [4, 7], []]] match-limit = 1 [[test]] name = "basic202" regex = '''((foo)|bar)!bas''' haystack = '''foo!bas''' matches = [[[0, 7], [0, 3], [0, 3]]] match-limit = 1 anchored = true [[test]] name = "basic203" regex = '''(foo|(bar))!bas''' haystack = '''bar!bas''' matches = [[[0, 7], [0, 3], [0, 3]]] match-limit = 1 anchored = true [[test]] name = "basic204" regex = '''(foo|(bar))!bas''' haystack = '''foo!bar!bas''' matches = [[[4, 11], [4, 7], [4, 7]]] match-limit = 1 [[test]] name = "basic205" regex = '''(foo|(bar))!bas''' haystack = '''foo!bas''' matches = [[[0, 7], [0, 3], []]] match-limit = 1 anchored = true [[test]] name = "basic206" regex = '''(foo|bar)!bas''' haystack = '''bar!bas''' matches = [[[0, 7], [0, 3]]] match-limit = 1 anchored = true [[test]] name = "basic207" regex = '''(foo|bar)!bas''' haystack = '''foo!bar!bas''' matches = [[[4, 11], [4, 7]]] match-limit = 1 [[test]] name = "basic208" regex = '''(foo|bar)!bas''' haystack = '''foo!bas''' matches = [[[0, 7], [0, 3]]] match-limit = 1 anchored = true [[test]] name = "basic209" regex = '''^(([^!]+!)?([^!]+)|.+!([^!]+!)([^!]+))$''' haystack = '''foo!bar!bas''' matches = [[[0, 11], [0, 11], [], [], [4, 8], [8, 11]]] match-limit = 1 anchored = true [[test]] name = "basic210" regex = '''^([^!]+!)?([^!]+)$|^.+!([^!]+!)([^!]+)$''' haystack = '''bas''' matches = [[[0, 3], [], [0, 3], [], []]] match-limit = 1 anchored = true [[test]] name = "basic211" regex = '''^([^!]+!)?([^!]+)$|^.+!([^!]+!)([^!]+)$''' haystack = '''bar!bas''' matches = [[[0, 7], [0, 4], [4, 7], [], []]] match-limit = 1 anchored = true [[test]] name = "basic212" regex = '''^([^!]+!)?([^!]+)$|^.+!([^!]+!)([^!]+)$''' haystack = '''foo!bar!bas''' matches = [[[0, 11], [], [], [4, 8], [8, 11]]] match-limit = 1 anchored = true [[test]] name = "basic213" regex = '''^([^!]+!)?([^!]+)$|^.+!([^!]+!)([^!]+)$''' haystack = '''foo!bas''' matches = [[[0, 7], [0, 4], [4, 7], [], []]] match-limit = 1 anchored = true [[test]] name = "basic214" regex = '''^(([^!]+!)?([^!]+)|.+!([^!]+!)([^!]+))$''' haystack = '''bas''' matches = [[[0, 3], [0, 3], [], [0, 3], [], []]] match-limit = 1 anchored = true [[test]] name = "basic215" regex = '''^(([^!]+!)?([^!]+)|.+!([^!]+!)([^!]+))$''' haystack = '''bar!bas''' matches = [[[0, 7], [0, 7], [0, 4], [4, 7], [], []]] match-limit = 1 anchored = true [[test]] name = "basic216" regex = '''^(([^!]+!)?([^!]+)|.+!([^!]+!)([^!]+))$''' haystack = '''foo!bar!bas''' matches = [[[0, 11], [0, 11], [], [], [4, 8], [8, 11]]] match-limit = 1 anchored = true [[test]] name = "basic217" regex = '''^(([^!]+!)?([^!]+)|.+!([^!]+!)([^!]+))$''' haystack = '''foo!bas''' matches = [[[0, 7], [0, 7], [0, 4], [4, 7], [], []]] match-limit = 1 anchored = true [[test]] name = "basic218" regex = '''.*(/XXX).*''' haystack = '''/XXX''' matches = [[[0, 4], [0, 4]]] match-limit = 1 anchored = true [[test]] name = "basic219" regex = '''.*(\\XXX).*''' haystack = '''\XXX''' matches = [[[0, 4], [0, 4]]] match-limit = 1 anchored = true [[test]] name = "basic220" regex = '''\\XXX''' haystack = '''\XXX''' matches = [[[0, 4]]] match-limit = 1 anchored = true [[test]] name = "basic221" regex = '''.*(/000).*''' haystack = '''/000''' matches = [[[0, 4], [0, 4]]] match-limit = 1 anchored = true [[test]] name = "basic222" regex = '''.*(\\000).*''' haystack = '''\000''' matches = [[[0, 4], [0, 4]]] match-limit = 1 anchored = true [[test]] name = "basic223" regex = '''\\000''' haystack = '''\000''' matches = [[[0, 4]]] match-limit = 1 anchored = true ��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������regex-1.12.2/testdata/fowler/dat/README�������������������������������������������������������������0000644�0000000�0000000�00000002111�10461020230�0015553�0����������������������������������������������������������������������������������������������������ustar �����������������������������������������������������������������0000000�0000000������������������������������������������������������������������������������������������������������������������������������������������������������������������������Test data was taken from the Go distribution, which was in turn taken from the testregex test suite: http://web.archive.org/web/20150925124103/http://www2.research.att.com/~astopen/testregex/testregex.html Unfortunately, the original web site now appears dead, but the test data lives on. The LICENSE in this directory corresponds to the LICENSE that the data was originally released under. The tests themselves were modified for RE2/Go (and marked as such). A couple were modified further by me (Andrew Gallant) and marked with 'Rust'. After some number of years, these tests were transformed into a TOML format using the 'regex-cli generate fowler' command. To re-generate the TOML files, run the following from the root of this repository: regex-cli generate fowler tests/data/fowler tests/data/fowler/dat/*.dat This assumes that you have 'regex-cli' installed. See 'regex-cli/README.md' from the root of the repository for more information. This brings the Fowler tests into a more "sensible" structured format in which other tests can be written such that they aren't write-only. �������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������regex-1.12.2/testdata/fowler/dat/basic.dat����������������������������������������������������������0000644�0000000�0000000�00000021744�10461020230�0016463�0����������������������������������������������������������������������������������������������������ustar �����������������������������������������������������������������0000000�0000000������������������������������������������������������������������������������������������������������������������������������������������������������������������������NOTE all standard compliant implementations should pass these : 2002-05-31 BE abracadabra$ abracadabracadabra (7,18) BE a...b abababbb (2,7) BE XXXXXX ..XXXXXX (2,8) E \) () (1,2) BE a] a]a (0,2) B } } (0,1) E \} } (0,1) BE \] ] (0,1) B ] ] (0,1) E ] ] (0,1) B { { (0,1) B } } (0,1) BE ^a ax (0,1) BE \^a a^a (1,3) BE a\^ a^ (0,2) BE a$ aa (1,2) BE a\$ a$ (0,2) BE ^$ NULL (0,0) E $^ NULL (0,0) E a($) aa (1,2)(2,2) E a*(^a) aa (0,1)(0,1) E (..)*(...)* a (0,0) E (..)*(...)* abcd (0,4)(2,4) E (ab|a)(bc|c) abc (0,3)(0,2)(2,3) E (ab)c|abc abc (0,3)(0,2) E a{0}b ab (1,2) E (a*)(b?)(b+)b{3} aaabbbbbbb (0,10)(0,3)(3,4)(4,7) E (a*)(b{0,1})(b{1,})b{3} aaabbbbbbb (0,10)(0,3)(3,4)(4,7) E a{9876543210} NULL BADBR E ((a|a)|a) a (0,1)(0,1)(0,1) E (a*)(a|aa) aaaa (0,4)(0,3)(3,4) E a*(a.|aa) aaaa (0,4)(2,4) E a(b)|c(d)|a(e)f aef (0,3)(?,?)(?,?)(1,2) E (a|b)?.* b (0,1)(0,1) E (a|b)c|a(b|c) ac (0,2)(0,1) E (a|b)c|a(b|c) ab (0,2)(?,?)(1,2) E (a|b)*c|(a|ab)*c abc (0,3)(1,2) E (a|b)*c|(a|ab)*c xc (1,2) E (.a|.b).*|.*(.a|.b) xa (0,2)(0,2) E a?(ab|ba)ab abab (0,4)(0,2) E a?(ac{0}b|ba)ab abab (0,4)(0,2) E ab|abab abbabab (0,2) E aba|bab|bba baaabbbaba (5,8) E aba|bab baaabbbaba (6,9) E (aa|aaa)*|(a|aaaaa) aa (0,2)(0,2) E (a.|.a.)*|(a|.a...) aa (0,2)(0,2) E ab|a xabc (1,3) E ab|a xxabc (2,4) Ei (Ab|cD)* aBcD (0,4)(2,4) BE [^-] --a (2,3) BE [a-]* --a (0,3) BE [a-m-]* --amoma-- (0,4) E :::1:::0:|:::1:1:0: :::0:::1:::1:::0: (8,17) E :::1:::0:|:::1:1:1: :::0:::1:::1:::0: (8,17) {E [[:upper:]] A (0,1) [[<element>]] not supported E [[:lower:]]+ `az{ (1,3) E [[:upper:]]+ @AZ[ (1,3) # No collation in Go #BE [[-]] [[-]] (2,4) #BE [[.NIL.]] NULL ECOLLATE #BE [[=aleph=]] NULL ECOLLATE } BE$ \n \n (0,1) BEn$ \n \n (0,1) BE$ [^a] \n (0,1) BE$ \na \na (0,2) E (a)(b)(c) abc (0,3)(0,1)(1,2)(2,3) BE xxx xxx (0,3) #E1 (^|[ (,;])((([Ff]eb[^ ]* *|0*2/|\* */?)0*[6-7]))([^0-9]|$) feb 6, (0,6) E (?:^|[ (,;])(?:(?:(?:[Ff]eb[^ ]* *|0*2/|\* */?)0*[6-7]))(?:[^0-9]|$) feb 6, (0,6) Rust #E1 (^|[ (,;])((([Ff]eb[^ ]* *|0*2/|\* */?)0*[6-7]))([^0-9]|$) 2/7 (0,3) E (?:^|[ (,;])(?:(?:(?:[Ff]eb[^ ]* *|0*2/|\* */?)0*[6-7]))(?:[^0-9]|$) 2/7 (0,3) Rust #E1 (^|[ (,;])((([Ff]eb[^ ]* *|0*2/|\* */?)0*[6-7]))([^0-9]|$) feb 1,Feb 6 (5,11) E (?:^|[ (,;])(?:(?:(?:[Ff]eb[^ ]* *|0*2/|\* */?)0*[6-7]))(?:[^0-9]|$) feb 1,Feb 6 (5,11) Rust #E3 ((((((((((((((((((((((((((((((x)))))))))))))))))))))))))))))) x (0,1)(0,1)(0,1) E (((?:(?:(?:(?:(?:(?:(?:(?:(?:(?:(?:(?:(?:(?:(?:(?:(?:(?:(?:(?:(?:(?:(?:(?:(?:(?:(?:(?:x)))))))))))))))))))))))))))))) x (0,1)(0,1)(0,1) Rust #E3 ((((((((((((((((((((((((((((((x))))))))))))))))))))))))))))))* xx (0,2)(1,2)(1,2) E (((?:(?:(?:(?:(?:(?:(?:(?:(?:(?:(?:(?:(?:(?:(?:(?:(?:(?:(?:(?:(?:(?:(?:(?:(?:(?:(?:(?:x))))))))))))))))))))))))))))))* xx (0,2)(1,2)(1,2) Rust E a?(ab|ba)* ababababababababababababababababababababababababababababababababababababababababa (0,81)(79,81) E abaa|abbaa|abbbaa|abbbbaa ababbabbbabbbabbbbabbbbaa (18,25) E abaa|abbaa|abbbaa|abbbbaa ababbabbbabbbabbbbabaa (18,22) E aaac|aabc|abac|abbc|baac|babc|bbac|bbbc baaabbbabac (7,11) #BE$ .* \x01\xff (0,2) BE$ .* \x01\x7f (0,2) Rust E aaaa|bbbb|cccc|ddddd|eeeeee|fffffff|gggg|hhhh|iiiii|jjjjj|kkkkk|llll XaaaXbbbXcccXdddXeeeXfffXgggXhhhXiiiXjjjXkkkXlllXcbaXaaaa (53,57) L aaaa\nbbbb\ncccc\nddddd\neeeeee\nfffffff\ngggg\nhhhh\niiiii\njjjjj\nkkkkk\nllll XaaaXbbbXcccXdddXeeeXfffXgggXhhhXiiiXjjjXkkkXlllXcbaXaaaa NOMATCH E a*a*a*a*a*b aaaaaaaaab (0,10) BE ^ NULL (0,0) BE $ NULL (0,0) BE ^$ NULL (0,0) BE ^a$ a (0,1) BE abc abc (0,3) BE abc xabcy (1,4) BE abc ababc (2,5) BE ab*c abc (0,3) BE ab*bc abc (0,3) BE ab*bc abbc (0,4) BE ab*bc abbbbc (0,6) E ab+bc abbc (0,4) E ab+bc abbbbc (0,6) E ab?bc abbc (0,4) E ab?bc abc (0,3) E ab?c abc (0,3) BE ^abc$ abc (0,3) BE ^abc abcc (0,3) BE abc$ aabc (1,4) BE ^ abc (0,0) BE $ abc (3,3) BE a.c abc (0,3) BE a.c axc (0,3) BE a.*c axyzc (0,5) BE a[bc]d abd (0,3) BE a[b-d]e ace (0,3) BE a[b-d] aac (1,3) BE a[-b] a- (0,2) BE a[b-] a- (0,2) BE a] a] (0,2) BE a[]]b a]b (0,3) BE a[^bc]d aed (0,3) BE a[^-b]c adc (0,3) BE a[^]b]c adc (0,3) E ab|cd abc (0,2) E ab|cd abcd (0,2) E a\(b a(b (0,3) E a\(*b ab (0,2) E a\(*b a((b (0,4) E ((a)) abc (0,1)(0,1)(0,1) E (a)b(c) abc (0,3)(0,1)(2,3) E a+b+c aabbabc (4,7) E a* aaa (0,3) E (a*)* - (0,0)(0,0) E (a*)+ - (0,0)(0,0) E (a*|b)* - (0,0)(0,0) E (a+|b)* ab (0,2)(1,2) E (a+|b)+ ab (0,2)(1,2) E (a+|b)? ab (0,1)(0,1) BE [^ab]* cde (0,3) E (^)* - (0,0)(0,0) BE a* NULL (0,0) E ([abc])*d abbbcd (0,6)(4,5) E ([abc])*bcd abcd (0,4)(0,1) E a|b|c|d|e e (0,1) E (a|b|c|d|e)f ef (0,2)(0,1) E ((a*|b))* - (0,0)(0,0)(0,0) BE abcd*efg abcdefg (0,7) BE ab* xabyabbbz (1,3) BE ab* xayabbbz (1,2) E (ab|cd)e abcde (2,5)(2,4) BE [abhgefdc]ij hij (0,3) E (a|b)c*d abcd (1,4)(1,2) E (ab|ab*)bc abc (0,3)(0,1) E a([bc]*)c* abc (0,3)(1,3) E a([bc]*)(c*d) abcd (0,4)(1,3)(3,4) E a([bc]+)(c*d) abcd (0,4)(1,3)(3,4) E a([bc]*)(c+d) abcd (0,4)(1,2)(2,4) E a[bcd]*dcdcde adcdcde (0,7) E (ab|a)b*c abc (0,3)(0,2) E ((a)(b)c)(d) abcd (0,4)(0,3)(0,1)(1,2)(3,4) BE [A-Za-z_][A-Za-z0-9_]* alpha (0,5) E ^a(bc+|b[eh])g|.h$ abh (1,3) E (bc+d$|ef*g.|h?i(j|k)) effgz (0,5)(0,5) E (bc+d$|ef*g.|h?i(j|k)) ij (0,2)(0,2)(1,2) E (bc+d$|ef*g.|h?i(j|k)) reffgz (1,6)(1,6) E (((((((((a))))))))) a (0,1)(0,1)(0,1)(0,1)(0,1)(0,1)(0,1)(0,1)(0,1)(0,1) BE multiple words multiple words yeah (0,14) E (.*)c(.*) abcde (0,5)(0,2)(3,5) BE abcd abcd (0,4) E a(bc)d abcd (0,4)(1,3) E a[-]?c ac (0,3) E M[ou]'?am+[ae]r .*([AEae]l[- ])?[GKQ]h?[aeu]+([dtz][dhz]?)+af[iy] Muammar Qaddafi (0,15)(?,?)(10,12) E M[ou]'?am+[ae]r .*([AEae]l[- ])?[GKQ]h?[aeu]+([dtz][dhz]?)+af[iy] Mo'ammar Gadhafi (0,16)(?,?)(11,13) E M[ou]'?am+[ae]r .*([AEae]l[- ])?[GKQ]h?[aeu]+([dtz][dhz]?)+af[iy] Muammar Kaddafi (0,15)(?,?)(10,12) E M[ou]'?am+[ae]r .*([AEae]l[- ])?[GKQ]h?[aeu]+([dtz][dhz]?)+af[iy] Muammar Qadhafi (0,15)(?,?)(10,12) E M[ou]'?am+[ae]r .*([AEae]l[- ])?[GKQ]h?[aeu]+([dtz][dhz]?)+af[iy] Muammar Gadafi (0,14)(?,?)(10,11) E M[ou]'?am+[ae]r .*([AEae]l[- ])?[GKQ]h?[aeu]+([dtz][dhz]?)+af[iy] Mu'ammar Qadafi (0,15)(?,?)(11,12) E M[ou]'?am+[ae]r .*([AEae]l[- ])?[GKQ]h?[aeu]+([dtz][dhz]?)+af[iy] Moamar Gaddafi (0,14)(?,?)(9,11) E M[ou]'?am+[ae]r .*([AEae]l[- ])?[GKQ]h?[aeu]+([dtz][dhz]?)+af[iy] Mu'ammar Qadhdhafi (0,18)(?,?)(13,15) E M[ou]'?am+[ae]r .*([AEae]l[- ])?[GKQ]h?[aeu]+([dtz][dhz]?)+af[iy] Muammar Khaddafi (0,16)(?,?)(11,13) E M[ou]'?am+[ae]r .*([AEae]l[- ])?[GKQ]h?[aeu]+([dtz][dhz]?)+af[iy] Muammar Ghaddafy (0,16)(?,?)(11,13) E M[ou]'?am+[ae]r .*([AEae]l[- ])?[GKQ]h?[aeu]+([dtz][dhz]?)+af[iy] Muammar Ghadafi (0,15)(?,?)(11,12) E M[ou]'?am+[ae]r .*([AEae]l[- ])?[GKQ]h?[aeu]+([dtz][dhz]?)+af[iy] Muammar Ghaddafi (0,16)(?,?)(11,13) E M[ou]'?am+[ae]r .*([AEae]l[- ])?[GKQ]h?[aeu]+([dtz][dhz]?)+af[iy] Muamar Kaddafi (0,14)(?,?)(9,11) E M[ou]'?am+[ae]r .*([AEae]l[- ])?[GKQ]h?[aeu]+([dtz][dhz]?)+af[iy] Muammar Quathafi (0,16)(?,?)(11,13) E M[ou]'?am+[ae]r .*([AEae]l[- ])?[GKQ]h?[aeu]+([dtz][dhz]?)+af[iy] Muammar Gheddafi (0,16)(?,?)(11,13) E M[ou]'?am+[ae]r .*([AEae]l[- ])?[GKQ]h?[aeu]+([dtz][dhz]?)+af[iy] Moammar Khadafy (0,15)(?,?)(11,12) E M[ou]'?am+[ae]r .*([AEae]l[- ])?[GKQ]h?[aeu]+([dtz][dhz]?)+af[iy] Moammar Qudhafi (0,15)(?,?)(10,12) E a+(b|c)*d+ aabcdd (0,6)(3,4) E ^.+$ vivi (0,4) E ^(.+)$ vivi (0,4)(0,4) E ^([^!.]+).att.com!(.+)$ gryphon.att.com!eby (0,19)(0,7)(16,19) E ^([^!]+!)?([^!]+)$ bas (0,3)(?,?)(0,3) E ^([^!]+!)?([^!]+)$ bar!bas (0,7)(0,4)(4,7) E ^([^!]+!)?([^!]+)$ foo!bas (0,7)(0,4)(4,7) E ^.+!([^!]+!)([^!]+)$ foo!bar!bas (0,11)(4,8)(8,11) E ((foo)|(bar))!bas bar!bas (0,7)(0,3)(?,?)(0,3) E ((foo)|(bar))!bas foo!bar!bas (4,11)(4,7)(?,?)(4,7) E ((foo)|(bar))!bas foo!bas (0,7)(0,3)(0,3) E ((foo)|bar)!bas bar!bas (0,7)(0,3) E ((foo)|bar)!bas foo!bar!bas (4,11)(4,7) E ((foo)|bar)!bas foo!bas (0,7)(0,3)(0,3) E (foo|(bar))!bas bar!bas (0,7)(0,3)(0,3) E (foo|(bar))!bas foo!bar!bas (4,11)(4,7)(4,7) E (foo|(bar))!bas foo!bas (0,7)(0,3) E (foo|bar)!bas bar!bas (0,7)(0,3) E (foo|bar)!bas foo!bar!bas (4,11)(4,7) E (foo|bar)!bas foo!bas (0,7)(0,3) E ^(([^!]+!)?([^!]+)|.+!([^!]+!)([^!]+))$ foo!bar!bas (0,11)(0,11)(?,?)(?,?)(4,8)(8,11) E ^([^!]+!)?([^!]+)$|^.+!([^!]+!)([^!]+)$ bas (0,3)(?,?)(0,3) E ^([^!]+!)?([^!]+)$|^.+!([^!]+!)([^!]+)$ bar!bas (0,7)(0,4)(4,7) E ^([^!]+!)?([^!]+)$|^.+!([^!]+!)([^!]+)$ foo!bar!bas (0,11)(?,?)(?,?)(4,8)(8,11) E ^([^!]+!)?([^!]+)$|^.+!([^!]+!)([^!]+)$ foo!bas (0,7)(0,4)(4,7) E ^(([^!]+!)?([^!]+)|.+!([^!]+!)([^!]+))$ bas (0,3)(0,3)(?,?)(0,3) E ^(([^!]+!)?([^!]+)|.+!([^!]+!)([^!]+))$ bar!bas (0,7)(0,7)(0,4)(4,7) E ^(([^!]+!)?([^!]+)|.+!([^!]+!)([^!]+))$ foo!bar!bas (0,11)(0,11)(?,?)(?,?)(4,8)(8,11) E ^(([^!]+!)?([^!]+)|.+!([^!]+!)([^!]+))$ foo!bas (0,7)(0,7)(0,4)(4,7) E .*(/XXX).* /XXX (0,4)(0,4) E .*(\\XXX).* \XXX (0,4)(0,4) E \\XXX \XXX (0,4) E .*(/000).* /000 (0,4)(0,4) E .*(\\000).* \000 (0,4)(0,4) E \\000 \000 (0,4) ����������������������������regex-1.12.2/testdata/fowler/dat/nullsubexpr.dat����������������������������������������������������0000644�0000000�0000000�00000003612�10461020230�0017757�0����������������������������������������������������������������������������������������������������ustar �����������������������������������������������������������������0000000�0000000������������������������������������������������������������������������������������������������������������������������������������������������������������������������NOTE null subexpression matches : 2002-06-06 E (a*)* a (0,1)(0,1) E SAME x (0,0)(0,0) E SAME aaaaaa (0,6)(0,6) E SAME aaaaaax (0,6)(0,6) E (a*)+ a (0,1)(0,1) E SAME x (0,0)(0,0) E SAME aaaaaa (0,6)(0,6) E SAME aaaaaax (0,6)(0,6) E (a+)* a (0,1)(0,1) E SAME x (0,0) E SAME aaaaaa (0,6)(0,6) E SAME aaaaaax (0,6)(0,6) E (a+)+ a (0,1)(0,1) E SAME x NOMATCH E SAME aaaaaa (0,6)(0,6) E SAME aaaaaax (0,6)(0,6) E ([a]*)* a (0,1)(0,1) E SAME x (0,0)(0,0) E SAME aaaaaa (0,6)(0,6) E SAME aaaaaax (0,6)(0,6) E ([a]*)+ a (0,1)(0,1) E SAME x (0,0)(0,0) E SAME aaaaaa (0,6)(0,6) E SAME aaaaaax (0,6)(0,6) E ([^b]*)* a (0,1)(0,1) E SAME b (0,0)(0,0) E SAME aaaaaa (0,6)(0,6) E SAME aaaaaab (0,6)(0,6) E ([ab]*)* a (0,1)(0,1) E SAME aaaaaa (0,6)(0,6) E SAME ababab (0,6)(0,6) E SAME bababa (0,6)(0,6) E SAME b (0,1)(0,1) E SAME bbbbbb (0,6)(0,6) E SAME aaaabcde (0,5)(0,5) E ([^a]*)* b (0,1)(0,1) E SAME bbbbbb (0,6)(0,6) E SAME aaaaaa (0,0)(0,0) E ([^ab]*)* ccccxx (0,6)(0,6) E SAME ababab (0,0)(0,0) #E ((z)+|a)* zabcde (0,2)(1,2) E ((z)+|a)* zabcde (0,2)(1,2)(0,1) Rust #{E a+? aaaaaa (0,1) no *? +? minimal match ops #E (a) aaa (0,1)(0,1) #E (a*?) aaa (0,0)(0,0) #E (a)*? aaa (0,0) #E (a*?)*? aaa (0,0) #} B \(a*\)*\(x\) x (0,1)(0,0)(0,1) B \(a*\)*\(x\) ax (0,2)(0,1)(1,2) B \(a*\)*\(x\) axa (0,2)(0,1)(1,2) B \(a*\)*\(x\)\(\1\) x (0,1)(0,0)(0,1)(1,1) B \(a*\)*\(x\)\(\1\) ax (0,2)(1,1)(1,2)(2,2) B \(a*\)*\(x\)\(\1\) axa (0,3)(0,1)(1,2)(2,3) B \(a*\)*\(x\)\(\1\)\(x\) axax (0,4)(0,1)(1,2)(2,3)(3,4) B \(a*\)*\(x\)\(\1\)\(x\) axxa (0,3)(1,1)(1,2)(2,2)(2,3) E (a*)*(x) x (0,1)(0,0)(0,1) E (a*)*(x) ax (0,2)(0,1)(1,2) E (a*)*(x) axa (0,2)(0,1)(1,2) E (a*)+(x) x (0,1)(0,0)(0,1) E (a*)+(x) ax (0,2)(0,1)(1,2) E (a*)+(x) axa (0,2)(0,1)(1,2) E (a*){2}(x) x (0,1)(0,0)(0,1) E (a*){2}(x) ax (0,2)(1,1)(1,2) E (a*){2}(x) axa (0,2)(1,1)(1,2) ����������������������������������������������������������������������������������������������������������������������regex-1.12.2/testdata/fowler/dat/repetition.dat�����������������������������������������������������0000644�0000000�0000000�00000015725�10461020230�0017566�0����������������������������������������������������������������������������������������������������ustar �����������������������������������������������������������������0000000�0000000������������������������������������������������������������������������������������������������������������������������������������������������������������������������NOTE implicit vs. explicit repetitions : 2009-02-02 # Glenn Fowler <gsf@research.att.com> # conforming matches (column 4) must match one of the following BREs # NOMATCH # (0,.)\((\(.\),\(.\))(?,?)(\2,\3)\)* # (0,.)\((\(.\),\(.\))(\2,\3)(?,?)\)* # i.e., each 3-tuple has two identical elements and one (?,?) E ((..)|(.)) NULL NOMATCH E ((..)|(.))((..)|(.)) NULL NOMATCH E ((..)|(.))((..)|(.))((..)|(.)) NULL NOMATCH E ((..)|(.)){1} NULL NOMATCH E ((..)|(.)){2} NULL NOMATCH E ((..)|(.)){3} NULL NOMATCH E ((..)|(.))* NULL (0,0) E ((..)|(.)) a (0,1)(0,1)(?,?)(0,1) E ((..)|(.))((..)|(.)) a NOMATCH E ((..)|(.))((..)|(.))((..)|(.)) a NOMATCH E ((..)|(.)){1} a (0,1)(0,1)(?,?)(0,1) E ((..)|(.)){2} a NOMATCH E ((..)|(.)){3} a NOMATCH E ((..)|(.))* a (0,1)(0,1)(?,?)(0,1) E ((..)|(.)) aa (0,2)(0,2)(0,2)(?,?) E ((..)|(.))((..)|(.)) aa (0,2)(0,1)(?,?)(0,1)(1,2)(?,?)(1,2) E ((..)|(.))((..)|(.))((..)|(.)) aa NOMATCH E ((..)|(.)){1} aa (0,2)(0,2)(0,2)(?,?) E ((..)|(.)){2} aa (0,2)(1,2)(?,?)(1,2) E ((..)|(.)){3} aa NOMATCH E ((..)|(.))* aa (0,2)(0,2)(0,2)(?,?) E ((..)|(.)) aaa (0,2)(0,2)(0,2)(?,?) E ((..)|(.))((..)|(.)) aaa (0,3)(0,2)(0,2)(?,?)(2,3)(?,?)(2,3) E ((..)|(.))((..)|(.))((..)|(.)) aaa (0,3)(0,1)(?,?)(0,1)(1,2)(?,?)(1,2)(2,3)(?,?)(2,3) E ((..)|(.)){1} aaa (0,2)(0,2)(0,2)(?,?) #E ((..)|(.)){2} aaa (0,3)(2,3)(?,?)(2,3) E ((..)|(.)){2} aaa (0,3)(2,3)(0,2)(2,3) RE2/Go E ((..)|(.)){3} aaa (0,3)(2,3)(?,?)(2,3) #E ((..)|(.))* aaa (0,3)(2,3)(?,?)(2,3) E ((..)|(.))* aaa (0,3)(2,3)(0,2)(2,3) RE2/Go E ((..)|(.)) aaaa (0,2)(0,2)(0,2)(?,?) E ((..)|(.))((..)|(.)) aaaa (0,4)(0,2)(0,2)(?,?)(2,4)(2,4)(?,?) E ((..)|(.))((..)|(.))((..)|(.)) aaaa (0,4)(0,2)(0,2)(?,?)(2,3)(?,?)(2,3)(3,4)(?,?)(3,4) E ((..)|(.)){1} aaaa (0,2)(0,2)(0,2)(?,?) E ((..)|(.)){2} aaaa (0,4)(2,4)(2,4)(?,?) #E ((..)|(.)){3} aaaa (0,4)(3,4)(?,?)(3,4) E ((..)|(.)){3} aaaa (0,4)(3,4)(0,2)(3,4) RE2/Go E ((..)|(.))* aaaa (0,4)(2,4)(2,4)(?,?) E ((..)|(.)) aaaaa (0,2)(0,2)(0,2)(?,?) E ((..)|(.))((..)|(.)) aaaaa (0,4)(0,2)(0,2)(?,?)(2,4)(2,4)(?,?) E ((..)|(.))((..)|(.))((..)|(.)) aaaaa (0,5)(0,2)(0,2)(?,?)(2,4)(2,4)(?,?)(4,5)(?,?)(4,5) E ((..)|(.)){1} aaaaa (0,2)(0,2)(0,2)(?,?) E ((..)|(.)){2} aaaaa (0,4)(2,4)(2,4)(?,?) #E ((..)|(.)){3} aaaaa (0,5)(4,5)(?,?)(4,5) E ((..)|(.)){3} aaaaa (0,5)(4,5)(2,4)(4,5) RE2/Go #E ((..)|(.))* aaaaa (0,5)(4,5)(?,?)(4,5) E ((..)|(.))* aaaaa (0,5)(4,5)(2,4)(4,5) RE2/Go E ((..)|(.)) aaaaaa (0,2)(0,2)(0,2)(?,?) E ((..)|(.))((..)|(.)) aaaaaa (0,4)(0,2)(0,2)(?,?)(2,4)(2,4)(?,?) E ((..)|(.))((..)|(.))((..)|(.)) aaaaaa (0,6)(0,2)(0,2)(?,?)(2,4)(2,4)(?,?)(4,6)(4,6)(?,?) E ((..)|(.)){1} aaaaaa (0,2)(0,2)(0,2)(?,?) E ((..)|(.)){2} aaaaaa (0,4)(2,4)(2,4)(?,?) E ((..)|(.)){3} aaaaaa (0,6)(4,6)(4,6)(?,?) E ((..)|(.))* aaaaaa (0,6)(4,6)(4,6)(?,?) NOTE additional repetition tests graciously provided by Chris Kuklewicz www.haskell.org 2009-02-02 # These test a bug in OS X / FreeBSD / NetBSD, and libtree. # Linux/GLIBC gets the {8,} and {8,8} wrong. :HA#100:E X(.?){0,}Y X1234567Y (0,9)(7,8) :HA#101:E X(.?){1,}Y X1234567Y (0,9)(7,8) :HA#102:E X(.?){2,}Y X1234567Y (0,9)(7,8) :HA#103:E X(.?){3,}Y X1234567Y (0,9)(7,8) :HA#104:E X(.?){4,}Y X1234567Y (0,9)(7,8) :HA#105:E X(.?){5,}Y X1234567Y (0,9)(7,8) :HA#106:E X(.?){6,}Y X1234567Y (0,9)(7,8) :HA#107:E X(.?){7,}Y X1234567Y (0,9)(7,8) :HA#108:E X(.?){8,}Y X1234567Y (0,9)(8,8) #:HA#110:E X(.?){0,8}Y X1234567Y (0,9)(7,8) :HA#110:E X(.?){0,8}Y X1234567Y (0,9)(8,8) RE2/Go #:HA#111:E X(.?){1,8}Y X1234567Y (0,9)(7,8) :HA#111:E X(.?){1,8}Y X1234567Y (0,9)(8,8) RE2/Go #:HA#112:E X(.?){2,8}Y X1234567Y (0,9)(7,8) :HA#112:E X(.?){2,8}Y X1234567Y (0,9)(8,8) RE2/Go #:HA#113:E X(.?){3,8}Y X1234567Y (0,9)(7,8) :HA#113:E X(.?){3,8}Y X1234567Y (0,9)(8,8) RE2/Go #:HA#114:E X(.?){4,8}Y X1234567Y (0,9)(7,8) :HA#114:E X(.?){4,8}Y X1234567Y (0,9)(8,8) RE2/Go #:HA#115:E X(.?){5,8}Y X1234567Y (0,9)(7,8) :HA#115:E X(.?){5,8}Y X1234567Y (0,9)(8,8) RE2/Go #:HA#116:E X(.?){6,8}Y X1234567Y (0,9)(7,8) :HA#116:E X(.?){6,8}Y X1234567Y (0,9)(8,8) RE2/Go #:HA#117:E X(.?){7,8}Y X1234567Y (0,9)(7,8) :HA#117:E X(.?){7,8}Y X1234567Y (0,9)(8,8) RE2/Go :HA#118:E X(.?){8,8}Y X1234567Y (0,9)(8,8) # These test a fixed bug in my regex-tdfa that did not keep the expanded # form properly grouped, so right association did the wrong thing with # these ambiguous patterns (crafted just to test my code when I became # suspicious of my implementation). The first subexpression should use # "ab" then "a" then "bcd". # OS X / FreeBSD / NetBSD badly fail many of these, with impossible # results like (0,6)(4,5)(6,6). #:HA#260:E (a|ab|c|bcd){0,}(d*) ababcd (0,6)(3,6)(6,6) :HA#260:E (a|ab|c|bcd){0,}(d*) ababcd (0,1)(0,1)(1,1) Rust #:HA#261:E (a|ab|c|bcd){1,}(d*) ababcd (0,6)(3,6)(6,6) :HA#261:E (a|ab|c|bcd){1,}(d*) ababcd (0,1)(0,1)(1,1) Rust :HA#262:E (a|ab|c|bcd){2,}(d*) ababcd (0,6)(3,6)(6,6) :HA#263:E (a|ab|c|bcd){3,}(d*) ababcd (0,6)(3,6)(6,6) :HA#264:E (a|ab|c|bcd){4,}(d*) ababcd NOMATCH #:HA#265:E (a|ab|c|bcd){0,10}(d*) ababcd (0,6)(3,6)(6,6) :HA#265:E (a|ab|c|bcd){0,10}(d*) ababcd (0,1)(0,1)(1,1) Rust #:HA#266:E (a|ab|c|bcd){1,10}(d*) ababcd (0,6)(3,6)(6,6) :HA#266:E (a|ab|c|bcd){1,10}(d*) ababcd (0,1)(0,1)(1,1) Rust :HA#267:E (a|ab|c|bcd){2,10}(d*) ababcd (0,6)(3,6)(6,6) :HA#268:E (a|ab|c|bcd){3,10}(d*) ababcd (0,6)(3,6)(6,6) :HA#269:E (a|ab|c|bcd){4,10}(d*) ababcd NOMATCH #:HA#270:E (a|ab|c|bcd)*(d*) ababcd (0,6)(3,6)(6,6) :HA#270:E (a|ab|c|bcd)*(d*) ababcd (0,1)(0,1)(1,1) Rust #:HA#271:E (a|ab|c|bcd)+(d*) ababcd (0,6)(3,6)(6,6) :HA#271:E (a|ab|c|bcd)+(d*) ababcd (0,1)(0,1)(1,1) Rust # The above worked on Linux/GLIBC but the following often fail. # They also trip up OS X / FreeBSD / NetBSD: #:HA#280:E (ab|a|c|bcd){0,}(d*) ababcd (0,6)(3,6)(6,6) :HA#280:E (ab|a|c|bcd){0,}(d*) ababcd (0,6)(4,5)(5,6) RE2/Go #:HA#281:E (ab|a|c|bcd){1,}(d*) ababcd (0,6)(3,6)(6,6) :HA#281:E (ab|a|c|bcd){1,}(d*) ababcd (0,6)(4,5)(5,6) RE2/Go #:HA#282:E (ab|a|c|bcd){2,}(d*) ababcd (0,6)(3,6)(6,6) :HA#282:E (ab|a|c|bcd){2,}(d*) ababcd (0,6)(4,5)(5,6) RE2/Go #:HA#283:E (ab|a|c|bcd){3,}(d*) ababcd (0,6)(3,6)(6,6) :HA#283:E (ab|a|c|bcd){3,}(d*) ababcd (0,6)(4,5)(5,6) RE2/Go :HA#284:E (ab|a|c|bcd){4,}(d*) ababcd NOMATCH #:HA#285:E (ab|a|c|bcd){0,10}(d*) ababcd (0,6)(3,6)(6,6) :HA#285:E (ab|a|c|bcd){0,10}(d*) ababcd (0,6)(4,5)(5,6) RE2/Go #:HA#286:E (ab|a|c|bcd){1,10}(d*) ababcd (0,6)(3,6)(6,6) :HA#286:E (ab|a|c|bcd){1,10}(d*) ababcd (0,6)(4,5)(5,6) RE2/Go #:HA#287:E (ab|a|c|bcd){2,10}(d*) ababcd (0,6)(3,6)(6,6) :HA#287:E (ab|a|c|bcd){2,10}(d*) ababcd (0,6)(4,5)(5,6) RE2/Go #:HA#288:E (ab|a|c|bcd){3,10}(d*) ababcd (0,6)(3,6)(6,6) :HA#288:E (ab|a|c|bcd){3,10}(d*) ababcd (0,6)(4,5)(5,6) RE2/Go :HA#289:E (ab|a|c|bcd){4,10}(d*) ababcd NOMATCH #:HA#290:E (ab|a|c|bcd)*(d*) ababcd (0,6)(3,6)(6,6) :HA#290:E (ab|a|c|bcd)*(d*) ababcd (0,6)(4,5)(5,6) RE2/Go #:HA#291:E (ab|a|c|bcd)+(d*) ababcd (0,6)(3,6)(6,6) :HA#291:E (ab|a|c|bcd)+(d*) ababcd (0,6)(4,5)(5,6) RE2/Go �������������������������������������������regex-1.12.2/testdata/fowler/nullsubexpr.toml�������������������������������������������������������0000644�0000000�0000000�00000015751�10461020230�0017421�0����������������������������������������������������������������������������������������������������ustar �����������������������������������������������������������������0000000�0000000������������������������������������������������������������������������������������������������������������������������������������������������������������������������# !!! DO NOT EDIT !!! # Automatically generated by 'regex-cli generate fowler'. # Numbers in the test names correspond to the line number of the test from # the original dat file. [[test]] name = "nullsubexpr3" regex = '''(a*)*''' haystack = '''a''' matches = [[[0, 1], [0, 1]]] match-limit = 1 anchored = true [[test]] name = "nullsubexpr4" regex = '''(a*)*''' haystack = '''x''' matches = [[[0, 0], [0, 0]]] match-limit = 1 anchored = true [[test]] name = "nullsubexpr5" regex = '''(a*)*''' haystack = '''aaaaaa''' matches = [[[0, 6], [0, 6]]] match-limit = 1 anchored = true [[test]] name = "nullsubexpr6" regex = '''(a*)*''' haystack = '''aaaaaax''' matches = [[[0, 6], [0, 6]]] match-limit = 1 anchored = true [[test]] name = "nullsubexpr7" regex = '''(a*)+''' haystack = '''a''' matches = [[[0, 1], [0, 1]]] match-limit = 1 anchored = true [[test]] name = "nullsubexpr8" regex = '''(a*)+''' haystack = '''x''' matches = [[[0, 0], [0, 0]]] match-limit = 1 anchored = true [[test]] name = "nullsubexpr9" regex = '''(a*)+''' haystack = '''aaaaaa''' matches = [[[0, 6], [0, 6]]] match-limit = 1 anchored = true [[test]] name = "nullsubexpr10" regex = '''(a*)+''' haystack = '''aaaaaax''' matches = [[[0, 6], [0, 6]]] match-limit = 1 anchored = true [[test]] name = "nullsubexpr11" regex = '''(a+)*''' haystack = '''a''' matches = [[[0, 1], [0, 1]]] match-limit = 1 anchored = true [[test]] name = "nullsubexpr12" regex = '''(a+)*''' haystack = '''x''' matches = [[[0, 0], []]] match-limit = 1 anchored = true [[test]] name = "nullsubexpr13" regex = '''(a+)*''' haystack = '''aaaaaa''' matches = [[[0, 6], [0, 6]]] match-limit = 1 anchored = true [[test]] name = "nullsubexpr14" regex = '''(a+)*''' haystack = '''aaaaaax''' matches = [[[0, 6], [0, 6]]] match-limit = 1 anchored = true [[test]] name = "nullsubexpr15" regex = '''(a+)+''' haystack = '''a''' matches = [[[0, 1], [0, 1]]] match-limit = 1 anchored = true [[test]] name = "nullsubexpr16" regex = '''(a+)+''' haystack = '''x''' matches = [] match-limit = 1 [[test]] name = "nullsubexpr17" regex = '''(a+)+''' haystack = '''aaaaaa''' matches = [[[0, 6], [0, 6]]] match-limit = 1 anchored = true [[test]] name = "nullsubexpr18" regex = '''(a+)+''' haystack = '''aaaaaax''' matches = [[[0, 6], [0, 6]]] match-limit = 1 anchored = true [[test]] name = "nullsubexpr20" regex = '''([a]*)*''' haystack = '''a''' matches = [[[0, 1], [0, 1]]] match-limit = 1 anchored = true [[test]] name = "nullsubexpr21" regex = '''([a]*)*''' haystack = '''x''' matches = [[[0, 0], [0, 0]]] match-limit = 1 anchored = true [[test]] name = "nullsubexpr22" regex = '''([a]*)*''' haystack = '''aaaaaa''' matches = [[[0, 6], [0, 6]]] match-limit = 1 anchored = true [[test]] name = "nullsubexpr23" regex = '''([a]*)*''' haystack = '''aaaaaax''' matches = [[[0, 6], [0, 6]]] match-limit = 1 anchored = true [[test]] name = "nullsubexpr24" regex = '''([a]*)+''' haystack = '''a''' matches = [[[0, 1], [0, 1]]] match-limit = 1 anchored = true [[test]] name = "nullsubexpr25" regex = '''([a]*)+''' haystack = '''x''' matches = [[[0, 0], [0, 0]]] match-limit = 1 anchored = true [[test]] name = "nullsubexpr26" regex = '''([a]*)+''' haystack = '''aaaaaa''' matches = [[[0, 6], [0, 6]]] match-limit = 1 anchored = true [[test]] name = "nullsubexpr27" regex = '''([a]*)+''' haystack = '''aaaaaax''' matches = [[[0, 6], [0, 6]]] match-limit = 1 anchored = true [[test]] name = "nullsubexpr28" regex = '''([^b]*)*''' haystack = '''a''' matches = [[[0, 1], [0, 1]]] match-limit = 1 anchored = true [[test]] name = "nullsubexpr29" regex = '''([^b]*)*''' haystack = '''b''' matches = [[[0, 0], [0, 0]]] match-limit = 1 anchored = true [[test]] name = "nullsubexpr30" regex = '''([^b]*)*''' haystack = '''aaaaaa''' matches = [[[0, 6], [0, 6]]] match-limit = 1 anchored = true [[test]] name = "nullsubexpr31" regex = '''([^b]*)*''' haystack = '''aaaaaab''' matches = [[[0, 6], [0, 6]]] match-limit = 1 anchored = true [[test]] name = "nullsubexpr32" regex = '''([ab]*)*''' haystack = '''a''' matches = [[[0, 1], [0, 1]]] match-limit = 1 anchored = true [[test]] name = "nullsubexpr33" regex = '''([ab]*)*''' haystack = '''aaaaaa''' matches = [[[0, 6], [0, 6]]] match-limit = 1 anchored = true [[test]] name = "nullsubexpr34" regex = '''([ab]*)*''' haystack = '''ababab''' matches = [[[0, 6], [0, 6]]] match-limit = 1 anchored = true [[test]] name = "nullsubexpr35" regex = '''([ab]*)*''' haystack = '''bababa''' matches = [[[0, 6], [0, 6]]] match-limit = 1 anchored = true [[test]] name = "nullsubexpr36" regex = '''([ab]*)*''' haystack = '''b''' matches = [[[0, 1], [0, 1]]] match-limit = 1 anchored = true [[test]] name = "nullsubexpr37" regex = '''([ab]*)*''' haystack = '''bbbbbb''' matches = [[[0, 6], [0, 6]]] match-limit = 1 anchored = true [[test]] name = "nullsubexpr38" regex = '''([ab]*)*''' haystack = '''aaaabcde''' matches = [[[0, 5], [0, 5]]] match-limit = 1 anchored = true [[test]] name = "nullsubexpr39" regex = '''([^a]*)*''' haystack = '''b''' matches = [[[0, 1], [0, 1]]] match-limit = 1 anchored = true [[test]] name = "nullsubexpr40" regex = '''([^a]*)*''' haystack = '''bbbbbb''' matches = [[[0, 6], [0, 6]]] match-limit = 1 anchored = true [[test]] name = "nullsubexpr41" regex = '''([^a]*)*''' haystack = '''aaaaaa''' matches = [[[0, 0], [0, 0]]] match-limit = 1 anchored = true [[test]] name = "nullsubexpr42" regex = '''([^ab]*)*''' haystack = '''ccccxx''' matches = [[[0, 6], [0, 6]]] match-limit = 1 anchored = true [[test]] name = "nullsubexpr43" regex = '''([^ab]*)*''' haystack = '''ababab''' matches = [[[0, 0], [0, 0]]] match-limit = 1 anchored = true # Test added by Rust regex project. [[test]] name = "nullsubexpr46" regex = '''((z)+|a)*''' haystack = '''zabcde''' matches = [[[0, 2], [1, 2], [0, 1]]] match-limit = 1 anchored = true [[test]] name = "nullsubexpr64" regex = '''(a*)*(x)''' haystack = '''x''' matches = [[[0, 1], [0, 0], [0, 1]]] match-limit = 1 anchored = true [[test]] name = "nullsubexpr65" regex = '''(a*)*(x)''' haystack = '''ax''' matches = [[[0, 2], [0, 1], [1, 2]]] match-limit = 1 anchored = true [[test]] name = "nullsubexpr66" regex = '''(a*)*(x)''' haystack = '''axa''' matches = [[[0, 2], [0, 1], [1, 2]]] match-limit = 1 anchored = true [[test]] name = "nullsubexpr68" regex = '''(a*)+(x)''' haystack = '''x''' matches = [[[0, 1], [0, 0], [0, 1]]] match-limit = 1 anchored = true [[test]] name = "nullsubexpr69" regex = '''(a*)+(x)''' haystack = '''ax''' matches = [[[0, 2], [0, 1], [1, 2]]] match-limit = 1 anchored = true [[test]] name = "nullsubexpr70" regex = '''(a*)+(x)''' haystack = '''axa''' matches = [[[0, 2], [0, 1], [1, 2]]] match-limit = 1 anchored = true [[test]] name = "nullsubexpr72" regex = '''(a*){2}(x)''' haystack = '''x''' matches = [[[0, 1], [0, 0], [0, 1]]] match-limit = 1 anchored = true [[test]] name = "nullsubexpr73" regex = '''(a*){2}(x)''' haystack = '''ax''' matches = [[[0, 2], [1, 1], [1, 2]]] match-limit = 1 anchored = true [[test]] name = "nullsubexpr74" regex = '''(a*){2}(x)''' haystack = '''axa''' matches = [[[0, 2], [1, 1], [1, 2]]] match-limit = 1 anchored = true �����������������������regex-1.12.2/testdata/fowler/repetition.toml��������������������������������������������������������0000644�0000000�0000000�00000035724�10461020230�0017222�0����������������������������������������������������������������������������������������������������ustar �����������������������������������������������������������������0000000�0000000������������������������������������������������������������������������������������������������������������������������������������������������������������������������# !!! DO NOT EDIT !!! # Automatically generated by 'regex-cli generate fowler'. # Numbers in the test names correspond to the line number of the test from # the original dat file. [[test]] name = "repetition10" regex = '''((..)|(.))''' haystack = '''''' matches = [] match-limit = 1 [[test]] name = "repetition11" regex = '''((..)|(.))((..)|(.))''' haystack = '''''' matches = [] match-limit = 1 [[test]] name = "repetition12" regex = '''((..)|(.))((..)|(.))((..)|(.))''' haystack = '''''' matches = [] match-limit = 1 [[test]] name = "repetition14" regex = '''((..)|(.)){1}''' haystack = '''''' matches = [] match-limit = 1 [[test]] name = "repetition15" regex = '''((..)|(.)){2}''' haystack = '''''' matches = [] match-limit = 1 [[test]] name = "repetition16" regex = '''((..)|(.)){3}''' haystack = '''''' matches = [] match-limit = 1 [[test]] name = "repetition18" regex = '''((..)|(.))*''' haystack = '''''' matches = [[[0, 0], [], [], []]] match-limit = 1 anchored = true [[test]] name = "repetition20" regex = '''((..)|(.))''' haystack = '''a''' matches = [[[0, 1], [0, 1], [], [0, 1]]] match-limit = 1 anchored = true [[test]] name = "repetition21" regex = '''((..)|(.))((..)|(.))''' haystack = '''a''' matches = [] match-limit = 1 [[test]] name = "repetition22" regex = '''((..)|(.))((..)|(.))((..)|(.))''' haystack = '''a''' matches = [] match-limit = 1 [[test]] name = "repetition24" regex = '''((..)|(.)){1}''' haystack = '''a''' matches = [[[0, 1], [0, 1], [], [0, 1]]] match-limit = 1 anchored = true [[test]] name = "repetition25" regex = '''((..)|(.)){2}''' haystack = '''a''' matches = [] match-limit = 1 [[test]] name = "repetition26" regex = '''((..)|(.)){3}''' haystack = '''a''' matches = [] match-limit = 1 [[test]] name = "repetition28" regex = '''((..)|(.))*''' haystack = '''a''' matches = [[[0, 1], [0, 1], [], [0, 1]]] match-limit = 1 anchored = true [[test]] name = "repetition30" regex = '''((..)|(.))''' haystack = '''aa''' matches = [[[0, 2], [0, 2], [0, 2], []]] match-limit = 1 anchored = true [[test]] name = "repetition31" regex = '''((..)|(.))((..)|(.))''' haystack = '''aa''' matches = [[[0, 2], [0, 1], [], [0, 1], [1, 2], [], [1, 2]]] match-limit = 1 anchored = true [[test]] name = "repetition32" regex = '''((..)|(.))((..)|(.))((..)|(.))''' haystack = '''aa''' matches = [] match-limit = 1 [[test]] name = "repetition34" regex = '''((..)|(.)){1}''' haystack = '''aa''' matches = [[[0, 2], [0, 2], [0, 2], []]] match-limit = 1 anchored = true [[test]] name = "repetition35" regex = '''((..)|(.)){2}''' haystack = '''aa''' matches = [[[0, 2], [1, 2], [], [1, 2]]] match-limit = 1 anchored = true [[test]] name = "repetition36" regex = '''((..)|(.)){3}''' haystack = '''aa''' matches = [] match-limit = 1 [[test]] name = "repetition38" regex = '''((..)|(.))*''' haystack = '''aa''' matches = [[[0, 2], [0, 2], [0, 2], []]] match-limit = 1 anchored = true [[test]] name = "repetition40" regex = '''((..)|(.))''' haystack = '''aaa''' matches = [[[0, 2], [0, 2], [0, 2], []]] match-limit = 1 anchored = true [[test]] name = "repetition41" regex = '''((..)|(.))((..)|(.))''' haystack = '''aaa''' matches = [[[0, 3], [0, 2], [0, 2], [], [2, 3], [], [2, 3]]] match-limit = 1 anchored = true [[test]] name = "repetition42" regex = '''((..)|(.))((..)|(.))((..)|(.))''' haystack = '''aaa''' matches = [[[0, 3], [0, 1], [], [0, 1], [1, 2], [], [1, 2], [2, 3], [], [2, 3]]] match-limit = 1 anchored = true [[test]] name = "repetition44" regex = '''((..)|(.)){1}''' haystack = '''aaa''' matches = [[[0, 2], [0, 2], [0, 2], []]] match-limit = 1 anchored = true # Test added by RE2/Go project. [[test]] name = "repetition46" regex = '''((..)|(.)){2}''' haystack = '''aaa''' matches = [[[0, 3], [2, 3], [0, 2], [2, 3]]] match-limit = 1 anchored = true [[test]] name = "repetition47" regex = '''((..)|(.)){3}''' haystack = '''aaa''' matches = [[[0, 3], [2, 3], [], [2, 3]]] match-limit = 1 anchored = true # Test added by RE2/Go project. [[test]] name = "repetition50" regex = '''((..)|(.))*''' haystack = '''aaa''' matches = [[[0, 3], [2, 3], [0, 2], [2, 3]]] match-limit = 1 anchored = true [[test]] name = "repetition52" regex = '''((..)|(.))''' haystack = '''aaaa''' matches = [[[0, 2], [0, 2], [0, 2], []]] match-limit = 1 anchored = true [[test]] name = "repetition53" regex = '''((..)|(.))((..)|(.))''' haystack = '''aaaa''' matches = [[[0, 4], [0, 2], [0, 2], [], [2, 4], [2, 4], []]] match-limit = 1 anchored = true [[test]] name = "repetition54" regex = '''((..)|(.))((..)|(.))((..)|(.))''' haystack = '''aaaa''' matches = [[[0, 4], [0, 2], [0, 2], [], [2, 3], [], [2, 3], [3, 4], [], [3, 4]]] match-limit = 1 anchored = true [[test]] name = "repetition56" regex = '''((..)|(.)){1}''' haystack = '''aaaa''' matches = [[[0, 2], [0, 2], [0, 2], []]] match-limit = 1 anchored = true [[test]] name = "repetition57" regex = '''((..)|(.)){2}''' haystack = '''aaaa''' matches = [[[0, 4], [2, 4], [2, 4], []]] match-limit = 1 anchored = true # Test added by RE2/Go project. [[test]] name = "repetition59" regex = '''((..)|(.)){3}''' haystack = '''aaaa''' matches = [[[0, 4], [3, 4], [0, 2], [3, 4]]] match-limit = 1 anchored = true [[test]] name = "repetition61" regex = '''((..)|(.))*''' haystack = '''aaaa''' matches = [[[0, 4], [2, 4], [2, 4], []]] match-limit = 1 anchored = true [[test]] name = "repetition63" regex = '''((..)|(.))''' haystack = '''aaaaa''' matches = [[[0, 2], [0, 2], [0, 2], []]] match-limit = 1 anchored = true [[test]] name = "repetition64" regex = '''((..)|(.))((..)|(.))''' haystack = '''aaaaa''' matches = [[[0, 4], [0, 2], [0, 2], [], [2, 4], [2, 4], []]] match-limit = 1 anchored = true [[test]] name = "repetition65" regex = '''((..)|(.))((..)|(.))((..)|(.))''' haystack = '''aaaaa''' matches = [[[0, 5], [0, 2], [0, 2], [], [2, 4], [2, 4], [], [4, 5], [], [4, 5]]] match-limit = 1 anchored = true [[test]] name = "repetition67" regex = '''((..)|(.)){1}''' haystack = '''aaaaa''' matches = [[[0, 2], [0, 2], [0, 2], []]] match-limit = 1 anchored = true [[test]] name = "repetition68" regex = '''((..)|(.)){2}''' haystack = '''aaaaa''' matches = [[[0, 4], [2, 4], [2, 4], []]] match-limit = 1 anchored = true # Test added by RE2/Go project. [[test]] name = "repetition70" regex = '''((..)|(.)){3}''' haystack = '''aaaaa''' matches = [[[0, 5], [4, 5], [2, 4], [4, 5]]] match-limit = 1 anchored = true # Test added by RE2/Go project. [[test]] name = "repetition73" regex = '''((..)|(.))*''' haystack = '''aaaaa''' matches = [[[0, 5], [4, 5], [2, 4], [4, 5]]] match-limit = 1 anchored = true [[test]] name = "repetition75" regex = '''((..)|(.))''' haystack = '''aaaaaa''' matches = [[[0, 2], [0, 2], [0, 2], []]] match-limit = 1 anchored = true [[test]] name = "repetition76" regex = '''((..)|(.))((..)|(.))''' haystack = '''aaaaaa''' matches = [[[0, 4], [0, 2], [0, 2], [], [2, 4], [2, 4], []]] match-limit = 1 anchored = true [[test]] name = "repetition77" regex = '''((..)|(.))((..)|(.))((..)|(.))''' haystack = '''aaaaaa''' matches = [[[0, 6], [0, 2], [0, 2], [], [2, 4], [2, 4], [], [4, 6], [4, 6], []]] match-limit = 1 anchored = true [[test]] name = "repetition79" regex = '''((..)|(.)){1}''' haystack = '''aaaaaa''' matches = [[[0, 2], [0, 2], [0, 2], []]] match-limit = 1 anchored = true [[test]] name = "repetition80" regex = '''((..)|(.)){2}''' haystack = '''aaaaaa''' matches = [[[0, 4], [2, 4], [2, 4], []]] match-limit = 1 anchored = true [[test]] name = "repetition81" regex = '''((..)|(.)){3}''' haystack = '''aaaaaa''' matches = [[[0, 6], [4, 6], [4, 6], []]] match-limit = 1 anchored = true [[test]] name = "repetition83" regex = '''((..)|(.))*''' haystack = '''aaaaaa''' matches = [[[0, 6], [4, 6], [4, 6], []]] match-limit = 1 anchored = true [[test]] name = "repetition-expensive90" regex = '''X(.?){0,}Y''' haystack = '''X1234567Y''' matches = [[[0, 9], [7, 8]]] match-limit = 1 anchored = true [[test]] name = "repetition-expensive91" regex = '''X(.?){1,}Y''' haystack = '''X1234567Y''' matches = [[[0, 9], [7, 8]]] match-limit = 1 anchored = true [[test]] name = "repetition-expensive92" regex = '''X(.?){2,}Y''' haystack = '''X1234567Y''' matches = [[[0, 9], [7, 8]]] match-limit = 1 anchored = true [[test]] name = "repetition-expensive93" regex = '''X(.?){3,}Y''' haystack = '''X1234567Y''' matches = [[[0, 9], [7, 8]]] match-limit = 1 anchored = true [[test]] name = "repetition-expensive94" regex = '''X(.?){4,}Y''' haystack = '''X1234567Y''' matches = [[[0, 9], [7, 8]]] match-limit = 1 anchored = true [[test]] name = "repetition-expensive95" regex = '''X(.?){5,}Y''' haystack = '''X1234567Y''' matches = [[[0, 9], [7, 8]]] match-limit = 1 anchored = true [[test]] name = "repetition-expensive96" regex = '''X(.?){6,}Y''' haystack = '''X1234567Y''' matches = [[[0, 9], [7, 8]]] match-limit = 1 anchored = true [[test]] name = "repetition-expensive97" regex = '''X(.?){7,}Y''' haystack = '''X1234567Y''' matches = [[[0, 9], [7, 8]]] match-limit = 1 anchored = true [[test]] name = "repetition-expensive98" regex = '''X(.?){8,}Y''' haystack = '''X1234567Y''' matches = [[[0, 9], [8, 8]]] match-limit = 1 anchored = true # Test added by RE2/Go project. [[test]] name = "repetition-expensive100" regex = '''X(.?){0,8}Y''' haystack = '''X1234567Y''' matches = [[[0, 9], [8, 8]]] match-limit = 1 anchored = true # Test added by RE2/Go project. [[test]] name = "repetition-expensive102" regex = '''X(.?){1,8}Y''' haystack = '''X1234567Y''' matches = [[[0, 9], [8, 8]]] match-limit = 1 anchored = true # Test added by RE2/Go project. [[test]] name = "repetition-expensive104" regex = '''X(.?){2,8}Y''' haystack = '''X1234567Y''' matches = [[[0, 9], [8, 8]]] match-limit = 1 anchored = true # Test added by RE2/Go project. [[test]] name = "repetition-expensive106" regex = '''X(.?){3,8}Y''' haystack = '''X1234567Y''' matches = [[[0, 9], [8, 8]]] match-limit = 1 anchored = true # Test added by RE2/Go project. [[test]] name = "repetition-expensive108" regex = '''X(.?){4,8}Y''' haystack = '''X1234567Y''' matches = [[[0, 9], [8, 8]]] match-limit = 1 anchored = true # Test added by RE2/Go project. [[test]] name = "repetition-expensive110" regex = '''X(.?){5,8}Y''' haystack = '''X1234567Y''' matches = [[[0, 9], [8, 8]]] match-limit = 1 anchored = true # Test added by RE2/Go project. [[test]] name = "repetition-expensive112" regex = '''X(.?){6,8}Y''' haystack = '''X1234567Y''' matches = [[[0, 9], [8, 8]]] match-limit = 1 anchored = true # Test added by RE2/Go project. [[test]] name = "repetition-expensive114" regex = '''X(.?){7,8}Y''' haystack = '''X1234567Y''' matches = [[[0, 9], [8, 8]]] match-limit = 1 anchored = true [[test]] name = "repetition-expensive115" regex = '''X(.?){8,8}Y''' haystack = '''X1234567Y''' matches = [[[0, 9], [8, 8]]] match-limit = 1 anchored = true # Test added by Rust regex project. [[test]] name = "repetition-expensive127" regex = '''(a|ab|c|bcd){0,}(d*)''' haystack = '''ababcd''' matches = [[[0, 1], [0, 1], [1, 1]]] match-limit = 1 anchored = true # Test added by Rust regex project. [[test]] name = "repetition-expensive129" regex = '''(a|ab|c|bcd){1,}(d*)''' haystack = '''ababcd''' matches = [[[0, 1], [0, 1], [1, 1]]] match-limit = 1 anchored = true [[test]] name = "repetition-expensive130" regex = '''(a|ab|c|bcd){2,}(d*)''' haystack = '''ababcd''' matches = [[[0, 6], [3, 6], [6, 6]]] match-limit = 1 anchored = true [[test]] name = "repetition-expensive131" regex = '''(a|ab|c|bcd){3,}(d*)''' haystack = '''ababcd''' matches = [[[0, 6], [3, 6], [6, 6]]] match-limit = 1 anchored = true [[test]] name = "repetition-expensive132" regex = '''(a|ab|c|bcd){4,}(d*)''' haystack = '''ababcd''' matches = [] match-limit = 1 # Test added by Rust regex project. [[test]] name = "repetition-expensive134" regex = '''(a|ab|c|bcd){0,10}(d*)''' haystack = '''ababcd''' matches = [[[0, 1], [0, 1], [1, 1]]] match-limit = 1 anchored = true # Test added by Rust regex project. [[test]] name = "repetition-expensive136" regex = '''(a|ab|c|bcd){1,10}(d*)''' haystack = '''ababcd''' matches = [[[0, 1], [0, 1], [1, 1]]] match-limit = 1 anchored = true [[test]] name = "repetition-expensive137" regex = '''(a|ab|c|bcd){2,10}(d*)''' haystack = '''ababcd''' matches = [[[0, 6], [3, 6], [6, 6]]] match-limit = 1 anchored = true [[test]] name = "repetition-expensive138" regex = '''(a|ab|c|bcd){3,10}(d*)''' haystack = '''ababcd''' matches = [[[0, 6], [3, 6], [6, 6]]] match-limit = 1 anchored = true [[test]] name = "repetition-expensive139" regex = '''(a|ab|c|bcd){4,10}(d*)''' haystack = '''ababcd''' matches = [] match-limit = 1 # Test added by Rust regex project. [[test]] name = "repetition-expensive141" regex = '''(a|ab|c|bcd)*(d*)''' haystack = '''ababcd''' matches = [[[0, 1], [0, 1], [1, 1]]] match-limit = 1 anchored = true # Test added by Rust regex project. [[test]] name = "repetition-expensive143" regex = '''(a|ab|c|bcd)+(d*)''' haystack = '''ababcd''' matches = [[[0, 1], [0, 1], [1, 1]]] match-limit = 1 anchored = true # Test added by RE2/Go project. [[test]] name = "repetition-expensive149" regex = '''(ab|a|c|bcd){0,}(d*)''' haystack = '''ababcd''' matches = [[[0, 6], [4, 5], [5, 6]]] match-limit = 1 anchored = true # Test added by RE2/Go project. [[test]] name = "repetition-expensive151" regex = '''(ab|a|c|bcd){1,}(d*)''' haystack = '''ababcd''' matches = [[[0, 6], [4, 5], [5, 6]]] match-limit = 1 anchored = true # Test added by RE2/Go project. [[test]] name = "repetition-expensive153" regex = '''(ab|a|c|bcd){2,}(d*)''' haystack = '''ababcd''' matches = [[[0, 6], [4, 5], [5, 6]]] match-limit = 1 anchored = true # Test added by RE2/Go project. [[test]] name = "repetition-expensive155" regex = '''(ab|a|c|bcd){3,}(d*)''' haystack = '''ababcd''' matches = [[[0, 6], [4, 5], [5, 6]]] match-limit = 1 anchored = true [[test]] name = "repetition-expensive156" regex = '''(ab|a|c|bcd){4,}(d*)''' haystack = '''ababcd''' matches = [] match-limit = 1 # Test added by RE2/Go project. [[test]] name = "repetition-expensive158" regex = '''(ab|a|c|bcd){0,10}(d*)''' haystack = '''ababcd''' matches = [[[0, 6], [4, 5], [5, 6]]] match-limit = 1 anchored = true # Test added by RE2/Go project. [[test]] name = "repetition-expensive160" regex = '''(ab|a|c|bcd){1,10}(d*)''' haystack = '''ababcd''' matches = [[[0, 6], [4, 5], [5, 6]]] match-limit = 1 anchored = true # Test added by RE2/Go project. [[test]] name = "repetition-expensive162" regex = '''(ab|a|c|bcd){2,10}(d*)''' haystack = '''ababcd''' matches = [[[0, 6], [4, 5], [5, 6]]] match-limit = 1 anchored = true # Test added by RE2/Go project. [[test]] name = "repetition-expensive164" regex = '''(ab|a|c|bcd){3,10}(d*)''' haystack = '''ababcd''' matches = [[[0, 6], [4, 5], [5, 6]]] match-limit = 1 anchored = true [[test]] name = "repetition-expensive165" regex = '''(ab|a|c|bcd){4,10}(d*)''' haystack = '''ababcd''' matches = [] match-limit = 1 # Test added by RE2/Go project. [[test]] name = "repetition-expensive167" regex = '''(ab|a|c|bcd)*(d*)''' haystack = '''ababcd''' matches = [[[0, 6], [4, 5], [5, 6]]] match-limit = 1 anchored = true # Test added by RE2/Go project. [[test]] name = "repetition-expensive169" regex = '''(ab|a|c|bcd)+(d*)''' haystack = '''ababcd''' matches = [[[0, 6], [4, 5], [5, 6]]] match-limit = 1 anchored = true ��������������������������������������������regex-1.12.2/testdata/iter.toml���������������������������������������������������������������������0000644�0000000�0000000�00000005344�10461020230�0014500�0����������������������������������������������������������������������������������������������������ustar �����������������������������������������������������������������0000000�0000000������������������������������������������������������������������������������������������������������������������������������������������������������������������������[[test]] name = "1" regex = "a" haystack = "aaa" matches = [[0, 1], [1, 2], [2, 3]] [[test]] name = "2" regex = "a" haystack = "aba" matches = [[0, 1], [2, 3]] [[test]] name = "empty1" regex = '' haystack = '' matches = [[0, 0]] [[test]] name = "empty2" regex = '' haystack = 'abc' matches = [[0, 0], [1, 1], [2, 2], [3, 3]] [[test]] name = "empty3" regex = '(?:)' haystack = 'abc' matches = [[0, 0], [1, 1], [2, 2], [3, 3]] [[test]] name = "empty4" regex = '(?:)*' haystack = 'abc' matches = [[0, 0], [1, 1], [2, 2], [3, 3]] [[test]] name = "empty5" regex = '(?:)+' haystack = 'abc' matches = [[0, 0], [1, 1], [2, 2], [3, 3]] [[test]] name = "empty6" regex = '(?:)?' haystack = 'abc' matches = [[0, 0], [1, 1], [2, 2], [3, 3]] [[test]] name = "empty7" regex = '(?:)(?:)' haystack = 'abc' matches = [[0, 0], [1, 1], [2, 2], [3, 3]] [[test]] name = "empty8" regex = '(?:)+|z' haystack = 'abc' matches = [[0, 0], [1, 1], [2, 2], [3, 3]] [[test]] name = "empty9" regex = 'z|(?:)+' haystack = 'abc' matches = [[0, 0], [1, 1], [2, 2], [3, 3]] [[test]] name = "empty10" regex = '(?:)+|b' haystack = 'abc' matches = [[0, 0], [1, 1], [2, 2], [3, 3]] [[test]] name = "empty11" regex = 'b|(?:)+' haystack = 'abc' matches = [[0, 0], [1, 2], [3, 3]] [[test]] name = "start1" regex = "^a" haystack = "a" matches = [[0, 1]] [[test]] name = "start2" regex = "^a" haystack = "aa" matches = [[0, 1]] [[test]] name = "anchored1" regex = "a" haystack = "a" matches = [[0, 1]] anchored = true # This test is pretty subtle. It demonstrates the crucial difference between # '^a' and 'a' compiled in 'anchored' mode. The former regex exclusively # matches at the start of a haystack and nowhere else. The latter regex has # no such restriction, but its automaton is constructed such that it lacks a # `.*?` prefix. So it can actually produce matches at multiple locations. # The anchored3 test drives this point home. [[test]] name = "anchored2" regex = "a" haystack = "aa" matches = [[0, 1], [1, 2]] anchored = true # Unlikely anchored2, this test stops matching anything after it sees `b` # since it lacks a `.*?` prefix. Since it is looking for 'a' but sees 'b', it # determines that there are no remaining matches. [[test]] name = "anchored3" regex = "a" haystack = "aaba" matches = [[0, 1], [1, 2]] anchored = true [[test]] name = "nonempty-followedby-empty" regex = 'abc|.*?' haystack = "abczzz" matches = [[0, 3], [4, 4], [5, 5], [6, 6]] [[test]] name = "nonempty-followedby-oneempty" regex = 'abc|.*?' haystack = "abcz" matches = [[0, 3], [4, 4]] [[test]] name = "nonempty-followedby-onemixed" regex = 'abc|.*?' haystack = "abczabc" matches = [[0, 3], [4, 7]] [[test]] name = "nonempty-followedby-twomixed" regex = 'abc|.*?' haystack = "abczzabc" matches = [[0, 3], [4, 4], [5, 8]] ��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������regex-1.12.2/testdata/leftmost-all.toml�������������������������������������������������������������0000644�0000000�0000000�00000000632�10461020230�0016133�0����������������������������������������������������������������������������������������������������ustar �����������������������������������������������������������������0000000�0000000������������������������������������������������������������������������������������������������������������������������������������������������������������������������[[test]] name = "alt" regex = 'foo|foobar' haystack = "foobar" matches = [[0, 6]] match-kind = "all" search-kind = "leftmost" [[test]] name = "multi" regex = ['foo', 'foobar'] haystack = "foobar" matches = [ { id = 1, span = [0, 6] }, ] match-kind = "all" search-kind = "leftmost" [[test]] name = "dotall" regex = '(?s:.)' haystack = "foobar" matches = [[5, 6]] match-kind = "all" search-kind = "leftmost" ������������������������������������������������������������������������������������������������������regex-1.12.2/testdata/line-terminator.toml����������������������������������������������������������0000644�0000000�0000000�00000005474�10461020230�0016652�0����������������������������������������������������������������������������������������������������ustar �����������������������������������������������������������������0000000�0000000������������������������������������������������������������������������������������������������������������������������������������������������������������������������# This tests that we can switch the line terminator to the NUL byte. [[test]] name = "nul" regex = '(?m)^[a-z]+$' haystack = '\x00abc\x00' matches = [[1, 4]] unescape = true line-terminator = '\x00' # This tests that '.' will not match the configured line terminator, but will # match \n. [[test]] name = "dot-changes-with-line-terminator" regex = '.' haystack = '\x00\n' matches = [[1, 2]] unescape = true line-terminator = '\x00' # This tests that when we switch the line terminator, \n is no longer # recognized as the terminator. [[test]] name = "not-line-feed" regex = '(?m)^[a-z]+$' haystack = '\nabc\n' matches = [] unescape = true line-terminator = '\x00' # This tests that we can set the line terminator to a non-ASCII byte and have # it behave as expected. [[test]] name = "non-ascii" regex = '(?m)^[a-z]+$' haystack = '\xFFabc\xFF' matches = [[1, 4]] unescape = true line-terminator = '\xFF' utf8 = false # This tests a tricky case where the line terminator is set to \r. This ensures # that the StartLF look-behind assertion is tracked when computing the start # state. [[test]] name = "carriage" regex = '(?m)^[a-z]+' haystack = 'ABC\rabc' matches = [[4, 7]] bounds = [4, 7] unescape = true line-terminator = '\r' # This tests that we can set the line terminator to a byte corresponding to a # word character, and things work as expected. [[test]] name = "word-byte" regex = '(?m)^[a-z]+$' haystack = 'ZabcZ' matches = [[1, 4]] unescape = true line-terminator = 'Z' # This tests that we can set the line terminator to a byte corresponding to a # non-word character, and things work as expected. [[test]] name = "non-word-byte" regex = '(?m)^[a-z]+$' haystack = '%abc%' matches = [[1, 4]] unescape = true line-terminator = '%' # This combines "set line terminator to a word byte" with a word boundary # assertion, which should result in no match even though ^/$ matches. [[test]] name = "word-boundary" regex = '(?m)^\b[a-z]+\b$' haystack = 'ZabcZ' matches = [] unescape = true line-terminator = 'Z' # Like 'word-boundary', but does an anchored search at the point where ^ # matches, but where \b should not. [[test]] name = "word-boundary-at" regex = '(?m)^\b[a-z]+\b$' haystack = 'ZabcZ' matches = [] bounds = [1, 4] anchored = true unescape = true line-terminator = 'Z' # Like 'word-boundary-at', but flips the word boundary to a negation. This # in particular tests a tricky case in DFA engines, where they must consider # explicitly that a starting configuration from a custom line terminator may # also required setting the "is from word byte" flag on a state. Otherwise, # it's treated as "not from a word byte," which would result in \B not matching # here when it should. [[test]] name = "not-word-boundary-at" regex = '(?m)^\B[a-z]+\B$' haystack = 'ZabcZ' matches = [[1, 4]] bounds = [1, 4] anchored = true unescape = true line-terminator = 'Z' ����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������regex-1.12.2/testdata/misc.toml���������������������������������������������������������������������0000644�0000000�0000000�00000002723�10461020230�0014466�0����������������������������������������������������������������������������������������������������ustar �����������������������������������������������������������������0000000�0000000������������������������������������������������������������������������������������������������������������������������������������������������������������������������[[test]] name = "ascii-literal" regex = "a" haystack = "a" matches = [[0, 1]] [[test]] name = "ascii-literal-not" regex = "a" haystack = "z" matches = [] [[test]] name = "ascii-literal-anchored" regex = "a" haystack = "a" matches = [[0, 1]] anchored = true [[test]] name = "ascii-literal-anchored-not" regex = "a" haystack = "z" matches = [] anchored = true [[test]] name = "anchor-start-end-line" regex = '(?m)^bar$' haystack = "foo\nbar\nbaz" matches = [[4, 7]] [[test]] name = "prefix-literal-match" regex = '^abc' haystack = "abc" matches = [[0, 3]] [[test]] name = "prefix-literal-match-ascii" regex = '^abc' haystack = "abc" matches = [[0, 3]] unicode = false utf8 = false [[test]] name = "prefix-literal-no-match" regex = '^abc' haystack = "zabc" matches = [] [[test]] name = "one-literal-edge" regex = 'abc' haystack = "xxxxxab" matches = [] [[test]] name = "terminates" regex = 'a$' haystack = "a" matches = [[0, 1]] [[test]] name = "suffix-100" regex = '.*abcd' haystack = "abcd" matches = [[0, 4]] [[test]] name = "suffix-200" regex = '.*(?:abcd)+' haystack = "abcd" matches = [[0, 4]] [[test]] name = "suffix-300" regex = '.*(?:abcd)+' haystack = "abcdabcd" matches = [[0, 8]] [[test]] name = "suffix-400" regex = '.*(?:abcd)+' haystack = "abcdxabcd" matches = [[0, 9]] [[test]] name = "suffix-500" regex = '.*x(?:abcd)+' haystack = "abcdxabcd" matches = [[0, 9]] [[test]] name = "suffix-600" regex = '[^abcd]*x(?:abcd)+' haystack = "abcdxabcd" matches = [[4, 9]] ���������������������������������������������regex-1.12.2/testdata/multiline.toml����������������������������������������������������������������0000644�0000000�0000000�00000040363�10461020230�0015537�0����������������������������������������������������������������������������������������������������ustar �����������������������������������������������������������������0000000�0000000������������������������������������������������������������������������������������������������������������������������������������������������������������������������[[test]] name = "basic1" regex = '(?m)^[a-z]+$' haystack = "abc\ndef\nxyz" matches = [[0, 3], [4, 7], [8, 11]] [[test]] name = "basic1-crlf" regex = '(?Rm)^[a-z]+$' haystack = "abc\ndef\nxyz" matches = [[0, 3], [4, 7], [8, 11]] [[test]] name = "basic1-crlf-cr" regex = '(?Rm)^[a-z]+$' haystack = "abc\rdef\rxyz" matches = [[0, 3], [4, 7], [8, 11]] [[test]] name = "basic2" regex = '(?m)^$' haystack = "abc\ndef\nxyz" matches = [] [[test]] name = "basic2-crlf" regex = '(?Rm)^$' haystack = "abc\ndef\nxyz" matches = [] [[test]] name = "basic2-crlf-cr" regex = '(?Rm)^$' haystack = "abc\rdef\rxyz" matches = [] [[test]] name = "basic3" regex = '(?m)^' haystack = "abc\ndef\nxyz" matches = [[0, 0], [4, 4], [8, 8]] [[test]] name = "basic3-crlf" regex = '(?Rm)^' haystack = "abc\ndef\nxyz" matches = [[0, 0], [4, 4], [8, 8]] [[test]] name = "basic3-crlf-cr" regex = '(?Rm)^' haystack = "abc\rdef\rxyz" matches = [[0, 0], [4, 4], [8, 8]] [[test]] name = "basic4" regex = '(?m)$' haystack = "abc\ndef\nxyz" matches = [[3, 3], [7, 7], [11, 11]] [[test]] name = "basic4-crlf" regex = '(?Rm)$' haystack = "abc\ndef\nxyz" matches = [[3, 3], [7, 7], [11, 11]] [[test]] name = "basic4-crlf-cr" regex = '(?Rm)$' haystack = "abc\rdef\rxyz" matches = [[3, 3], [7, 7], [11, 11]] [[test]] name = "basic5" regex = '(?m)^[a-z]' haystack = "abc\ndef\nxyz" matches = [[0, 1], [4, 5], [8, 9]] [[test]] name = "basic5-crlf" regex = '(?Rm)^[a-z]' haystack = "abc\ndef\nxyz" matches = [[0, 1], [4, 5], [8, 9]] [[test]] name = "basic5-crlf-cr" regex = '(?Rm)^[a-z]' haystack = "abc\rdef\rxyz" matches = [[0, 1], [4, 5], [8, 9]] [[test]] name = "basic6" regex = '(?m)[a-z]^' haystack = "abc\ndef\nxyz" matches = [] [[test]] name = "basic6-crlf" regex = '(?Rm)[a-z]^' haystack = "abc\ndef\nxyz" matches = [] [[test]] name = "basic6-crlf-cr" regex = '(?Rm)[a-z]^' haystack = "abc\rdef\rxyz" matches = [] [[test]] name = "basic7" regex = '(?m)[a-z]$' haystack = "abc\ndef\nxyz" matches = [[2, 3], [6, 7], [10, 11]] [[test]] name = "basic7-crlf" regex = '(?Rm)[a-z]$' haystack = "abc\ndef\nxyz" matches = [[2, 3], [6, 7], [10, 11]] [[test]] name = "basic7-crlf-cr" regex = '(?Rm)[a-z]$' haystack = "abc\rdef\rxyz" matches = [[2, 3], [6, 7], [10, 11]] [[test]] name = "basic8" regex = '(?m)$[a-z]' haystack = "abc\ndef\nxyz" matches = [] [[test]] name = "basic8-crlf" regex = '(?Rm)$[a-z]' haystack = "abc\ndef\nxyz" matches = [] [[test]] name = "basic8-crlf-cr" regex = '(?Rm)$[a-z]' haystack = "abc\rdef\rxyz" matches = [] [[test]] name = "basic9" regex = '(?m)^$' haystack = "" matches = [[0, 0]] [[test]] name = "basic9-crlf" regex = '(?Rm)^$' haystack = "" matches = [[0, 0]] [[test]] name = "repeat1" regex = '(?m)(?:^$)*' haystack = "a\nb\nc" matches = [[0, 0], [1, 1], [2, 2], [3, 3], [4, 4], [5, 5]] [[test]] name = "repeat1-crlf" regex = '(?Rm)(?:^$)*' haystack = "a\nb\nc" matches = [[0, 0], [1, 1], [2, 2], [3, 3], [4, 4], [5, 5]] [[test]] name = "repeat1-crlf-cr" regex = '(?Rm)(?:^$)*' haystack = "a\rb\rc" matches = [[0, 0], [1, 1], [2, 2], [3, 3], [4, 4], [5, 5]] [[test]] name = "repeat1-no-multi" regex = '(?:^$)*' haystack = "a\nb\nc" matches = [[0, 0], [1, 1], [2, 2], [3, 3], [4, 4], [5, 5]] [[test]] name = "repeat1-no-multi-crlf" regex = '(?R)(?:^$)*' haystack = "a\nb\nc" matches = [[0, 0], [1, 1], [2, 2], [3, 3], [4, 4], [5, 5]] [[test]] name = "repeat1-no-multi-crlf-cr" regex = '(?R)(?:^$)*' haystack = "a\rb\rc" matches = [[0, 0], [1, 1], [2, 2], [3, 3], [4, 4], [5, 5]] [[test]] name = "repeat2" regex = '(?m)(?:^|a)+' haystack = "a\naaa\n" matches = [[0, 0], [2, 2], [3, 5], [6, 6]] [[test]] name = "repeat2-crlf" regex = '(?Rm)(?:^|a)+' haystack = "a\naaa\n" matches = [[0, 0], [2, 2], [3, 5], [6, 6]] [[test]] name = "repeat2-crlf-cr" regex = '(?Rm)(?:^|a)+' haystack = "a\raaa\r" matches = [[0, 0], [2, 2], [3, 5], [6, 6]] [[test]] name = "repeat2-no-multi" regex = '(?:^|a)+' haystack = "a\naaa\n" matches = [[0, 0], [2, 5]] [[test]] name = "repeat2-no-multi-crlf" regex = '(?R)(?:^|a)+' haystack = "a\naaa\n" matches = [[0, 0], [2, 5]] [[test]] name = "repeat2-no-multi-crlf-cr" regex = '(?R)(?:^|a)+' haystack = "a\raaa\r" matches = [[0, 0], [2, 5]] [[test]] name = "repeat3" regex = '(?m)(?:^|a)*' haystack = "a\naaa\n" matches = [[0, 0], [1, 1], [2, 2], [3, 5], [6, 6]] [[test]] name = "repeat3-crlf" regex = '(?Rm)(?:^|a)*' haystack = "a\naaa\n" matches = [[0, 0], [1, 1], [2, 2], [3, 5], [6, 6]] [[test]] name = "repeat3-crlf-cr" regex = '(?Rm)(?:^|a)*' haystack = "a\raaa\r" matches = [[0, 0], [1, 1], [2, 2], [3, 5], [6, 6]] [[test]] name = "repeat3-no-multi" regex = '(?:^|a)*' haystack = "a\naaa\n" matches = [[0, 0], [1, 1], [2, 5], [6, 6]] [[test]] name = "repeat3-no-multi-crlf" regex = '(?R)(?:^|a)*' haystack = "a\naaa\n" matches = [[0, 0], [1, 1], [2, 5], [6, 6]] [[test]] name = "repeat3-no-multi-crlf-cr" regex = '(?R)(?:^|a)*' haystack = "a\raaa\r" matches = [[0, 0], [1, 1], [2, 5], [6, 6]] [[test]] name = "repeat4" regex = '(?m)(?:^|a+)' haystack = "a\naaa\n" matches = [[0, 0], [2, 2], [3, 5], [6, 6]] [[test]] name = "repeat4-crlf" regex = '(?Rm)(?:^|a+)' haystack = "a\naaa\n" matches = [[0, 0], [2, 2], [3, 5], [6, 6]] [[test]] name = "repeat4-crlf-cr" regex = '(?Rm)(?:^|a+)' haystack = "a\raaa\r" matches = [[0, 0], [2, 2], [3, 5], [6, 6]] [[test]] name = "repeat4-no-multi" regex = '(?:^|a+)' haystack = "a\naaa\n" matches = [[0, 0], [2, 5]] [[test]] name = "repeat4-no-multi-crlf" regex = '(?R)(?:^|a+)' haystack = "a\naaa\n" matches = [[0, 0], [2, 5]] [[test]] name = "repeat4-no-multi-crlf-cr" regex = '(?R)(?:^|a+)' haystack = "a\raaa\r" matches = [[0, 0], [2, 5]] [[test]] name = "repeat5" regex = '(?m)(?:^|a*)' haystack = "a\naaa\n" matches = [[0, 0], [1, 1], [2, 2], [3, 5], [6, 6]] [[test]] name = "repeat5-crlf" regex = '(?Rm)(?:^|a*)' haystack = "a\naaa\n" matches = [[0, 0], [1, 1], [2, 2], [3, 5], [6, 6]] [[test]] name = "repeat5-crlf-cr" regex = '(?Rm)(?:^|a*)' haystack = "a\raaa\r" matches = [[0, 0], [1, 1], [2, 2], [3, 5], [6, 6]] [[test]] name = "repeat5-no-multi" regex = '(?:^|a*)' haystack = "a\naaa\n" matches = [[0, 0], [1, 1], [2, 5], [6, 6]] [[test]] name = "repeat5-no-multi-crlf" regex = '(?R)(?:^|a*)' haystack = "a\naaa\n" matches = [[0, 0], [1, 1], [2, 5], [6, 6]] [[test]] name = "repeat5-no-multi-crlf-cr" regex = '(?R)(?:^|a*)' haystack = "a\raaa\r" matches = [[0, 0], [1, 1], [2, 5], [6, 6]] [[test]] name = "repeat6" regex = '(?m)(?:^[a-z])+' haystack = "abc\ndef\nxyz" matches = [[0, 1], [4, 5], [8, 9]] [[test]] name = "repeat6-crlf" regex = '(?Rm)(?:^[a-z])+' haystack = "abc\ndef\nxyz" matches = [[0, 1], [4, 5], [8, 9]] [[test]] name = "repeat6-crlf-cr" regex = '(?Rm)(?:^[a-z])+' haystack = "abc\rdef\rxyz" matches = [[0, 1], [4, 5], [8, 9]] [[test]] name = "repeat6-no-multi" regex = '(?:^[a-z])+' haystack = "abc\ndef\nxyz" matches = [[0, 1]] [[test]] name = "repeat6-no-multi-crlf" regex = '(?R)(?:^[a-z])+' haystack = "abc\ndef\nxyz" matches = [[0, 1]] [[test]] name = "repeat6-no-multi-crlf-cr" regex = '(?R)(?:^[a-z])+' haystack = "abc\rdef\rxyz" matches = [[0, 1]] [[test]] name = "repeat7" regex = '(?m)(?:^[a-z]{3}\n?)+' haystack = "abc\ndef\nxyz" matches = [[0, 11]] [[test]] name = "repeat7-crlf" regex = '(?Rm)(?:^[a-z]{3}\n?)+' haystack = "abc\ndef\nxyz" matches = [[0, 11]] [[test]] name = "repeat7-crlf-cr" regex = '(?Rm)(?:^[a-z]{3}\r?)+' haystack = "abc\rdef\rxyz" matches = [[0, 11]] [[test]] name = "repeat7-no-multi" regex = '(?:^[a-z]{3}\n?)+' haystack = "abc\ndef\nxyz" matches = [[0, 4]] [[test]] name = "repeat7-no-multi-crlf" regex = '(?R)(?:^[a-z]{3}\n?)+' haystack = "abc\ndef\nxyz" matches = [[0, 4]] [[test]] name = "repeat7-no-multi-crlf-cr" regex = '(?R)(?:^[a-z]{3}\r?)+' haystack = "abc\rdef\rxyz" matches = [[0, 4]] [[test]] name = "repeat8" regex = '(?m)(?:^[a-z]{3}\n?)*' haystack = "abc\ndef\nxyz" matches = [[0, 11]] [[test]] name = "repeat8-crlf" regex = '(?Rm)(?:^[a-z]{3}\n?)*' haystack = "abc\ndef\nxyz" matches = [[0, 11]] [[test]] name = "repeat8-crlf-cr" regex = '(?Rm)(?:^[a-z]{3}\r?)*' haystack = "abc\rdef\rxyz" matches = [[0, 11]] [[test]] name = "repeat8-no-multi" regex = '(?:^[a-z]{3}\n?)*' haystack = "abc\ndef\nxyz" matches = [[0, 4], [5, 5], [6, 6], [7, 7], [8, 8], [9, 9], [10, 10], [11, 11]] [[test]] name = "repeat8-no-multi-crlf" regex = '(?R)(?:^[a-z]{3}\n?)*' haystack = "abc\ndef\nxyz" matches = [[0, 4], [5, 5], [6, 6], [7, 7], [8, 8], [9, 9], [10, 10], [11, 11]] [[test]] name = "repeat8-no-multi-crlf-cr" regex = '(?R)(?:^[a-z]{3}\r?)*' haystack = "abc\rdef\rxyz" matches = [[0, 4], [5, 5], [6, 6], [7, 7], [8, 8], [9, 9], [10, 10], [11, 11]] [[test]] name = "repeat9" regex = '(?m)(?:\n?[a-z]{3}$)+' haystack = "abc\ndef\nxyz" matches = [[0, 11]] [[test]] name = "repeat9-crlf" regex = '(?Rm)(?:\n?[a-z]{3}$)+' haystack = "abc\ndef\nxyz" matches = [[0, 11]] [[test]] name = "repeat9-crlf-cr" regex = '(?Rm)(?:\r?[a-z]{3}$)+' haystack = "abc\rdef\rxyz" matches = [[0, 11]] [[test]] name = "repeat9-no-multi" regex = '(?:\n?[a-z]{3}$)+' haystack = "abc\ndef\nxyz" matches = [[7, 11]] [[test]] name = "repeat9-no-multi-crlf" regex = '(?R)(?:\n?[a-z]{3}$)+' haystack = "abc\ndef\nxyz" matches = [[7, 11]] [[test]] name = "repeat9-no-multi-crlf-cr" regex = '(?R)(?:\r?[a-z]{3}$)+' haystack = "abc\rdef\rxyz" matches = [[7, 11]] [[test]] name = "repeat10" regex = '(?m)(?:\n?[a-z]{3}$)*' haystack = "abc\ndef\nxyz" matches = [[0, 11]] [[test]] name = "repeat10-crlf" regex = '(?Rm)(?:\n?[a-z]{3}$)*' haystack = "abc\ndef\nxyz" matches = [[0, 11]] [[test]] name = "repeat10-crlf-cr" regex = '(?Rm)(?:\r?[a-z]{3}$)*' haystack = "abc\rdef\rxyz" matches = [[0, 11]] [[test]] name = "repeat10-no-multi" regex = '(?:\n?[a-z]{3}$)*' haystack = "abc\ndef\nxyz" matches = [[0, 0], [1, 1], [2, 2], [3, 3], [4, 4], [5, 5], [6, 6], [7, 11]] [[test]] name = "repeat10-no-multi-crlf" regex = '(?R)(?:\n?[a-z]{3}$)*' haystack = "abc\ndef\nxyz" matches = [[0, 0], [1, 1], [2, 2], [3, 3], [4, 4], [5, 5], [6, 6], [7, 11]] [[test]] name = "repeat10-no-multi-crlf-cr" regex = '(?R)(?:\r?[a-z]{3}$)*' haystack = "abc\rdef\rxyz" matches = [[0, 0], [1, 1], [2, 2], [3, 3], [4, 4], [5, 5], [6, 6], [7, 11]] [[test]] name = "repeat11" regex = '(?m)^*' haystack = "\naa\n" matches = [[0, 0], [1, 1], [2, 2], [3, 3], [4, 4]] [[test]] name = "repeat11-crlf" regex = '(?Rm)^*' haystack = "\naa\n" matches = [[0, 0], [1, 1], [2, 2], [3, 3], [4, 4]] [[test]] name = "repeat11-crlf-cr" regex = '(?Rm)^*' haystack = "\raa\r" matches = [[0, 0], [1, 1], [2, 2], [3, 3], [4, 4]] [[test]] name = "repeat11-no-multi" regex = '^*' haystack = "\naa\n" matches = [[0, 0], [1, 1], [2, 2], [3, 3], [4, 4]] [[test]] name = "repeat11-no-multi-crlf" regex = '(?R)^*' haystack = "\naa\n" matches = [[0, 0], [1, 1], [2, 2], [3, 3], [4, 4]] [[test]] name = "repeat11-no-multi-crlf-cr" regex = '(?R)^*' haystack = "\raa\r" matches = [[0, 0], [1, 1], [2, 2], [3, 3], [4, 4]] [[test]] name = "repeat12" regex = '(?m)^+' haystack = "\naa\n" matches = [[0, 0], [1, 1], [4, 4]] [[test]] name = "repeat12-crlf" regex = '(?Rm)^+' haystack = "\naa\n" matches = [[0, 0], [1, 1], [4, 4]] [[test]] name = "repeat12-crlf-cr" regex = '(?Rm)^+' haystack = "\raa\r" matches = [[0, 0], [1, 1], [4, 4]] [[test]] name = "repeat12-no-multi" regex = '^+' haystack = "\naa\n" matches = [[0, 0]] [[test]] name = "repeat12-no-multi-crlf" regex = '(?R)^+' haystack = "\naa\n" matches = [[0, 0]] [[test]] name = "repeat12-no-multi-crlf-cr" regex = '(?R)^+' haystack = "\raa\r" matches = [[0, 0]] [[test]] name = "repeat13" regex = '(?m)$*' haystack = "\naa\n" matches = [[0, 0], [1, 1], [2, 2], [3, 3], [4, 4]] [[test]] name = "repeat13-crlf" regex = '(?Rm)$*' haystack = "\naa\n" matches = [[0, 0], [1, 1], [2, 2], [3, 3], [4, 4]] [[test]] name = "repeat13-crlf-cr" regex = '(?Rm)$*' haystack = "\raa\r" matches = [[0, 0], [1, 1], [2, 2], [3, 3], [4, 4]] [[test]] name = "repeat13-no-multi" regex = '$*' haystack = "\naa\n" matches = [[0, 0], [1, 1], [2, 2], [3, 3], [4, 4]] [[test]] name = "repeat13-no-multi-crlf" regex = '(?R)$*' haystack = "\naa\n" matches = [[0, 0], [1, 1], [2, 2], [3, 3], [4, 4]] [[test]] name = "repeat13-no-multi-crlf-cr" regex = '(?R)$*' haystack = "\raa\r" matches = [[0, 0], [1, 1], [2, 2], [3, 3], [4, 4]] [[test]] name = "repeat14" regex = '(?m)$+' haystack = "\naa\n" matches = [[0, 0], [3, 3], [4, 4]] [[test]] name = "repeat14-crlf" regex = '(?Rm)$+' haystack = "\naa\n" matches = [[0, 0], [3, 3], [4, 4]] [[test]] name = "repeat14-crlf-cr" regex = '(?Rm)$+' haystack = "\raa\r" matches = [[0, 0], [3, 3], [4, 4]] [[test]] name = "repeat14-no-multi" regex = '$+' haystack = "\naa\n" matches = [[4, 4]] [[test]] name = "repeat14-no-multi-crlf" regex = '(?R)$+' haystack = "\naa\n" matches = [[4, 4]] [[test]] name = "repeat14-no-multi-crlf-cr" regex = '(?R)$+' haystack = "\raa\r" matches = [[4, 4]] [[test]] name = "repeat15" regex = '(?m)(?:$\n)+' haystack = "\n\naaa\n\n" matches = [[0, 2], [5, 7]] [[test]] name = "repeat15-crlf" regex = '(?Rm)(?:$\n)+' haystack = "\n\naaa\n\n" matches = [[0, 2], [5, 7]] [[test]] name = "repeat15-crlf-cr" regex = '(?Rm)(?:$\r)+' haystack = "\r\raaa\r\r" matches = [[0, 2], [5, 7]] [[test]] name = "repeat15-no-multi" regex = '(?:$\n)+' haystack = "\n\naaa\n\n" matches = [] [[test]] name = "repeat15-no-multi-crlf" regex = '(?R)(?:$\n)+' haystack = "\n\naaa\n\n" matches = [] [[test]] name = "repeat15-no-multi-crlf-cr" regex = '(?R)(?:$\r)+' haystack = "\r\raaa\r\r" matches = [] [[test]] name = "repeat16" regex = '(?m)(?:$\n)*' haystack = "\n\naaa\n\n" matches = [[0, 2], [3, 3], [4, 4], [5, 7]] [[test]] name = "repeat16-crlf" regex = '(?Rm)(?:$\n)*' haystack = "\n\naaa\n\n" matches = [[0, 2], [3, 3], [4, 4], [5, 7]] [[test]] name = "repeat16-crlf-cr" regex = '(?Rm)(?:$\r)*' haystack = "\r\raaa\r\r" matches = [[0, 2], [3, 3], [4, 4], [5, 7]] [[test]] name = "repeat16-no-multi" regex = '(?:$\n)*' haystack = "\n\naaa\n\n" matches = [[0, 0], [1, 1], [2, 2], [3, 3], [4, 4], [5, 5], [6, 6], [7, 7]] [[test]] name = "repeat16-no-multi-crlf" regex = '(?R)(?:$\n)*' haystack = "\n\naaa\n\n" matches = [[0, 0], [1, 1], [2, 2], [3, 3], [4, 4], [5, 5], [6, 6], [7, 7]] [[test]] name = "repeat16-no-multi-crlf-cr" regex = '(?R)(?:$\r)*' haystack = "\r\raaa\r\r" matches = [[0, 0], [1, 1], [2, 2], [3, 3], [4, 4], [5, 5], [6, 6], [7, 7]] [[test]] name = "repeat17" regex = '(?m)(?:$\n^)+' haystack = "\n\naaa\n\n" matches = [[0, 2], [5, 7]] [[test]] name = "repeat17-crlf" regex = '(?Rm)(?:$\n^)+' haystack = "\n\naaa\n\n" matches = [[0, 2], [5, 7]] [[test]] name = "repeat17-crlf-cr" regex = '(?Rm)(?:$\r^)+' haystack = "\r\raaa\r\r" matches = [[0, 2], [5, 7]] [[test]] name = "repeat17-no-multi" regex = '(?:$\n^)+' haystack = "\n\naaa\n\n" matches = [] [[test]] name = "repeat17-no-multi-crlf" regex = '(?R)(?:$\n^)+' haystack = "\n\naaa\n\n" matches = [] [[test]] name = "repeat17-no-multi-crlf-cr" regex = '(?R)(?:$\r^)+' haystack = "\r\raaa\r\r" matches = [] [[test]] name = "repeat18" regex = '(?m)(?:^|$)+' haystack = "\n\naaa\n\n" matches = [[0, 0], [1, 1], [2, 2], [5, 5], [6, 6], [7, 7]] [[test]] name = "repeat18-crlf" regex = '(?Rm)(?:^|$)+' haystack = "\n\naaa\n\n" matches = [[0, 0], [1, 1], [2, 2], [5, 5], [6, 6], [7, 7]] [[test]] name = "repeat18-crlf-cr" regex = '(?Rm)(?:^|$)+' haystack = "\r\raaa\r\r" matches = [[0, 0], [1, 1], [2, 2], [5, 5], [6, 6], [7, 7]] [[test]] name = "repeat18-no-multi" regex = '(?:^|$)+' haystack = "\n\naaa\n\n" matches = [[0, 0], [7, 7]] [[test]] name = "repeat18-no-multi-crlf" regex = '(?R)(?:^|$)+' haystack = "\n\naaa\n\n" matches = [[0, 0], [7, 7]] [[test]] name = "repeat18-no-multi-crlf-cr" regex = '(?R)(?:^|$)+' haystack = "\r\raaa\r\r" matches = [[0, 0], [7, 7]] [[test]] name = "match-line-100" regex = '(?m)^.+$' haystack = "aa\naaaaaaaaaaaaaaaaaaa\n" matches = [[0, 2], [3, 22]] [[test]] name = "match-line-100-crlf" regex = '(?Rm)^.+$' haystack = "aa\naaaaaaaaaaaaaaaaaaa\n" matches = [[0, 2], [3, 22]] [[test]] name = "match-line-100-crlf-cr" regex = '(?Rm)^.+$' haystack = "aa\raaaaaaaaaaaaaaaaaaa\r" matches = [[0, 2], [3, 22]] [[test]] name = "match-line-200" regex = '(?m)^.+$' haystack = "aa\naaaaaaaaaaaaaaaaaaa\n" matches = [[0, 2], [3, 22]] unicode = false utf8 = false [[test]] name = "match-line-200-crlf" regex = '(?Rm)^.+$' haystack = "aa\naaaaaaaaaaaaaaaaaaa\n" matches = [[0, 2], [3, 22]] unicode = false utf8 = false [[test]] name = "match-line-200-crlf-cr" regex = '(?Rm)^.+$' haystack = "aa\raaaaaaaaaaaaaaaaaaa\r" matches = [[0, 2], [3, 22]] unicode = false utf8 = false �����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������regex-1.12.2/testdata/no-unicode.toml���������������������������������������������������������������0000644�0000000�0000000�00000012070�10461020230�0015567�0����������������������������������������������������������������������������������������������������ustar �����������������������������������������������������������������0000000�0000000������������������������������������������������������������������������������������������������������������������������������������������������������������������������[[test]] name = "invalid-utf8-literal1" regex = '\xFF' haystack = '\xFF' matches = [[0, 1]] unicode = false utf8 = false unescape = true [[test]] name = "mixed" regex = '(?:.+)(?-u)(?:.+)' haystack = '\xCE\x93\xCE\x94\xFF' matches = [[0, 5]] utf8 = false unescape = true [[test]] name = "case1" regex = "a" haystack = "A" matches = [[0, 1]] case-insensitive = true unicode = false [[test]] name = "case2" regex = "[a-z]+" haystack = "AaAaA" matches = [[0, 5]] case-insensitive = true unicode = false [[test]] name = "case3" regex = "[a-z]+" haystack = "aA\u212AaA" matches = [[0, 7]] case-insensitive = true [[test]] name = "case4" regex = "[a-z]+" haystack = "aA\u212AaA" matches = [[0, 2], [5, 7]] case-insensitive = true unicode = false [[test]] name = "negate1" regex = "[^a]" haystack = "ฮด" matches = [[0, 2]] [[test]] name = "negate2" regex = "[^a]" haystack = "ฮด" matches = [[0, 1], [1, 2]] unicode = false utf8 = false [[test]] name = "dotstar-prefix1" regex = "a" haystack = '\xFFa' matches = [[1, 2]] unicode = false utf8 = false unescape = true [[test]] name = "dotstar-prefix2" regex = "a" haystack = '\xFFa' matches = [[1, 2]] utf8 = false unescape = true [[test]] name = "null-bytes1" regex = '[^\x00]+\x00' haystack = 'foo\x00' matches = [[0, 4]] unicode = false utf8 = false unescape = true [[test]] name = "word-ascii" regex = '\w+' haystack = "aฮด" matches = [[0, 1]] unicode = false [[test]] name = "word-unicode" regex = '\w+' haystack = "aฮด" matches = [[0, 3]] [[test]] name = "decimal-ascii" regex = '\d+' haystack = "1เฅจเฅฉ9" matches = [[0, 1], [7, 8]] unicode = false [[test]] name = "decimal-unicode" regex = '\d+' haystack = "1เฅจเฅฉ9" matches = [[0, 8]] [[test]] name = "space-ascii" regex = '\s+' haystack = " \u1680" matches = [[0, 1]] unicode = false [[test]] name = "space-unicode" regex = '\s+' haystack = " \u1680" matches = [[0, 4]] [[test]] # See: https://github.com/rust-lang/regex/issues/484 name = "iter1-bytes" regex = '' haystack = "โ˜ƒ" matches = [[0, 0], [1, 1], [2, 2], [3, 3]] utf8 = false [[test]] # See: https://github.com/rust-lang/regex/issues/484 name = "iter1-utf8" regex = '' haystack = "โ˜ƒ" matches = [[0, 0], [3, 3]] [[test]] # See: https://github.com/rust-lang/regex/issues/484 # Note that iter2-utf8 doesn't make sense here, since the input isn't UTF-8. name = "iter2-bytes" regex = '' haystack = 'b\xFFr' matches = [[0, 0], [1, 1], [2, 2], [3, 3]] unescape = true utf8 = false # These test that unanchored prefixes can munch through invalid UTF-8 even when # utf8 is enabled. # # This test actually reflects an interesting simplification in how the Thompson # NFA is constructed. It used to be that the NFA could be built with an # unanchored prefix that either matched any byte or _only_ matched valid UTF-8. # But the latter turns out to be pretty precarious when it comes to prefilters, # because if you search a haystack that contains invalid UTF-8 but have an # unanchored prefix that requires UTF-8, then prefilters are no longer a valid # optimization because you actually have to check that everything is valid # UTF-8. # # Originally, I had thought that we needed a valid UTF-8 unanchored prefix in # order to guarantee that we only match at valid UTF-8 boundaries. But this # isn't actually true! There are really only two things to consider here: # # 1) Will a regex match split an encoded codepoint? No. Because by construction, # we ensure that a MATCH state can only be reached by following valid UTF-8 (assuming # all of the UTF-8 modes are enabled). # # 2) Will a regex match arbitrary bytes that aren't valid UTF-8? Again, no, # assuming all of the UTF-8 modes are enabled. [[test]] name = "unanchored-invalid-utf8-match-100" regex = '[a-z]' haystack = '\xFFa\xFF' matches = [[1, 2]] unescape = true utf8 = false # This test shows that we can still prevent a match from occurring by requiring # that valid UTF-8 match by inserting our own unanchored prefix. Thus, if the # behavior of not munching through invalid UTF-8 anywhere is needed, then it # can be achieved thusly. [[test]] name = "unanchored-invalid-utf8-nomatch" regex = '^(?s:.)*?[a-z]' haystack = '\xFFa\xFF' matches = [] unescape = true utf8 = false # This is a tricky test that makes sure we don't accidentally do a kind of # unanchored search when we've requested that a regex engine not report # empty matches that split a codepoint. This test caught a regression during # development where the code for skipping over bad empty matches would do so # even if the search should have been anchored. This is ultimately what led to # making 'anchored' an 'Input' option, so that it was always clear what kind # of search was being performed. (Before that, whether a search was anchored # or not was a config knob on the regex engine.) This did wind up making DFAs # a little more complex to configure (with their 'StartKind' knob), but it # generally smoothed out everything else. # # Great example of a test whose failure motivated a sweeping API refactoring. [[test]] name = "anchored-iter-empty-utf8" regex = '' haystack = 'aโ˜ƒz' matches = [[0, 0], [1, 1]] unescape = false utf8 = true anchored = true ������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������regex-1.12.2/testdata/overlapping.toml��������������������������������������������������������������0000644�0000000�0000000�00000013222�10461020230�0016055�0����������������������������������������������������������������������������������������������������ustar �����������������������������������������������������������������0000000�0000000������������������������������������������������������������������������������������������������������������������������������������������������������������������������# NOTE: We define a number of tests where the *match* kind is 'leftmost-first' # but the *search* kind is 'overlapping'. This is a somewhat nonsensical # combination and can produce odd results. Nevertheless, those results should # be consistent so we test them here. (At the time of writing this note, I # hadn't yet decided whether to make 'leftmost-first' with 'overlapping' result # in unspecified behavior.) # This demonstrates how a full overlapping search is obvious quadratic. This # regex reports a match for every substring in the haystack. [[test]] name = "ungreedy-dotstar-matches-everything-100" regex = [".*?"] haystack = "zzz" matches = [ { id = 0, span = [0, 0] }, { id = 0, span = [1, 1] }, { id = 0, span = [0, 1] }, { id = 0, span = [2, 2] }, { id = 0, span = [1, 2] }, { id = 0, span = [0, 2] }, { id = 0, span = [3, 3] }, { id = 0, span = [2, 3] }, { id = 0, span = [1, 3] }, { id = 0, span = [0, 3] }, ] match-kind = "all" search-kind = "overlapping" [[test]] name = "greedy-dotstar-matches-everything-100" regex = [".*"] haystack = "zzz" matches = [ { id = 0, span = [0, 0] }, { id = 0, span = [1, 1] }, { id = 0, span = [0, 1] }, { id = 0, span = [2, 2] }, { id = 0, span = [1, 2] }, { id = 0, span = [0, 2] }, { id = 0, span = [3, 3] }, { id = 0, span = [2, 3] }, { id = 0, span = [1, 3] }, { id = 0, span = [0, 3] }, ] match-kind = "all" search-kind = "overlapping" [[test]] name = "repetition-plus-leftmost-first-100" regex = 'a+' haystack = "aaa" matches = [[0, 1], [1, 2], [0, 2], [2, 3], [1, 3], [0, 3]] match-kind = "leftmost-first" search-kind = "overlapping" [[test]] name = "repetition-plus-leftmost-first-110" regex = 'โ˜ƒ+' haystack = "โ˜ƒโ˜ƒโ˜ƒ" matches = [[0, 3], [3, 6], [0, 6], [6, 9], [3, 9], [0, 9]] match-kind = "leftmost-first" search-kind = "overlapping" [[test]] name = "repetition-plus-all-100" regex = 'a+' haystack = "aaa" matches = [[0, 1], [1, 2], [0, 2], [2, 3], [1, 3], [0, 3]] match-kind = "all" search-kind = "overlapping" [[test]] name = "repetition-plus-all-110" regex = 'โ˜ƒ+' haystack = "โ˜ƒโ˜ƒโ˜ƒ" matches = [[0, 3], [3, 6], [0, 6], [6, 9], [3, 9], [0, 9]] match-kind = "all" search-kind = "overlapping" [[test]] name = "repetition-plus-leftmost-first-200" regex = '(abc)+' haystack = "zzabcabczzabc" matches = [ [[2, 5], [2, 5]], [[5, 8], [5, 8]], [[2, 8], [5, 8]], ] match-kind = "leftmost-first" search-kind = "overlapping" [[test]] name = "repetition-plus-all-200" regex = '(abc)+' haystack = "zzabcabczzabc" matches = [ [[2, 5], [2, 5]], [[5, 8], [5, 8]], [[2, 8], [5, 8]], [[10, 13], [10, 13]], ] match-kind = "all" search-kind = "overlapping" [[test]] name = "repetition-star-leftmost-first-100" regex = 'a*' haystack = "aaa" matches = [ [0, 0], [1, 1], [0, 1], [2, 2], [1, 2], [0, 2], [3, 3], [2, 3], [1, 3], [0, 3], ] match-kind = "leftmost-first" search-kind = "overlapping" [[test]] name = "repetition-star-all-100" regex = 'a*' haystack = "aaa" matches = [ [0, 0], [1, 1], [0, 1], [2, 2], [1, 2], [0, 2], [3, 3], [2, 3], [1, 3], [0, 3], ] match-kind = "all" search-kind = "overlapping" [[test]] name = "repetition-star-leftmost-first-200" regex = '(abc)*' haystack = "zzabcabczzabc" matches = [ [[0, 0], []], ] match-kind = "leftmost-first" search-kind = "overlapping" [[test]] name = "repetition-star-all-200" regex = '(abc)*' haystack = "zzabcabczzabc" matches = [ [[0, 0], []], [[1, 1], []], [[2, 2], []], [[3, 3], []], [[4, 4], []], [[5, 5], []], [[2, 5], [2, 5]], [[6, 6], []], [[7, 7], []], [[8, 8], []], [[5, 8], [5, 8]], [[2, 8], [5, 8]], [[9, 9], []], [[10, 10], []], [[11, 11], []], [[12, 12], []], [[13, 13], []], [[10, 13], [10, 13]], ] match-kind = "all" search-kind = "overlapping" [[test]] name = "start-end-rep-leftmost-first" regex = '(^$)*' haystack = "abc" matches = [ [[0, 0], []], ] match-kind = "leftmost-first" search-kind = "overlapping" [[test]] name = "start-end-rep-all" regex = '(^$)*' haystack = "abc" matches = [ [[0, 0], []], [[1, 1], []], [[2, 2], []], [[3, 3], []], ] match-kind = "all" search-kind = "overlapping" [[test]] name = "alt-leftmost-first-100" regex = 'abc|a' haystack = "zzabcazzaabc" matches = [[2, 3], [2, 5]] match-kind = "leftmost-first" search-kind = "overlapping" [[test]] name = "alt-all-100" regex = 'abc|a' haystack = "zzabcazzaabc" matches = [[2, 3], [2, 5], [5, 6], [8, 9], [9, 10], [9, 12]] match-kind = "all" search-kind = "overlapping" [[test]] name = "empty-000" regex = "" haystack = "abc" matches = [[0, 0], [1, 1], [2, 2], [3, 3]] match-kind = "all" search-kind = "overlapping" [[test]] name = "empty-alt-000" regex = "|b" haystack = "abc" matches = [[0, 0], [1, 1], [2, 2], [1, 2], [3, 3]] match-kind = "all" search-kind = "overlapping" [[test]] name = "empty-alt-010" regex = "b|" haystack = "abc" matches = [[0, 0], [1, 1], [2, 2], [1, 2], [3, 3]] match-kind = "all" search-kind = "overlapping" [[test]] # See: https://github.com/rust-lang/regex/issues/484 name = "iter1-bytes" regex = '' haystack = "โ˜ƒ" matches = [[0, 0], [1, 1], [2, 2], [3, 3]] utf8 = false match-kind = "all" search-kind = "overlapping" [[test]] # See: https://github.com/rust-lang/regex/issues/484 name = "iter1-utf8" regex = '' haystack = "โ˜ƒ" matches = [[0, 0], [3, 3]] match-kind = "all" search-kind = "overlapping" [[test]] name = "iter1-incomplete-utf8" regex = '' haystack = '\xE2\x98' # incomplete snowman matches = [[0, 0], [1, 1], [2, 2]] match-kind = "all" search-kind = "overlapping" unescape = true utf8 = false [[test]] name = "scratch" regex = ['sam', 'samwise'] haystack = "samwise" matches = [ { id = 0, span = [0, 3] }, ] match-kind = "leftmost-first" search-kind = "overlapping" ������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������regex-1.12.2/testdata/regex-lite.toml���������������������������������������������������������������0000644�0000000�0000000�00000005050�10461020230�0015574�0����������������������������������������������������������������������������������������������������ustar �����������������������������������������������������������������0000000�0000000������������������������������������������������������������������������������������������������������������������������������������������������������������������������# These tests are specifically written to test the regex-lite crate. While it # largely has the same semantics as the regex crate, there are some differences # around Unicode support and UTF-8. # # To be clear, regex-lite supports far fewer patterns because of its lack of # Unicode support, nested character classes and character class set operations. # What we're talking about here are the patterns that both crates support but # where the semantics might differ. # regex-lite uses ASCII definitions for Perl character classes. [[test]] name = "perl-class-decimal" regex = '\d' haystack = 'แ •' matches = [] unicode = true # regex-lite uses ASCII definitions for Perl character classes. [[test]] name = "perl-class-space" regex = '\s' haystack = "\u2000" matches = [] unicode = true # regex-lite uses ASCII definitions for Perl character classes. [[test]] name = "perl-class-word" regex = '\w' haystack = 'ฮด' matches = [] unicode = true # regex-lite uses the ASCII definition of word for word boundary assertions. [[test]] name = "word-boundary" regex = '\b' haystack = 'ฮด' matches = [] unicode = true # regex-lite uses the ASCII definition of word for negated word boundary # assertions. But note that it should still not split codepoints! [[test]] name = "word-boundary-negated" regex = '\B' haystack = 'ฮด' matches = [[0, 0], [2, 2]] unicode = true # While we're here, the empty regex---which matches at every # position---shouldn't split a codepoint either. [[test]] name = "empty-no-split-codepoint" regex = '' haystack = '๐Ÿ’ฉ' matches = [[0, 0], [4, 4]] unicode = true # A dot always matches a full codepoint. [[test]] name = "dot-always-matches-codepoint" regex = '.' haystack = '๐Ÿ’ฉ' matches = [[0, 4]] unicode = false # A negated character class also always matches a full codepoint. [[test]] name = "negated-class-always-matches-codepoint" regex = '[^a]' haystack = '๐Ÿ’ฉ' matches = [[0, 4]] unicode = false # regex-lite only supports ASCII-aware case insensitive matching. [[test]] name = "case-insensitive-is-ascii-only" regex = 's' haystack = 'ลฟ' matches = [] unicode = true case-insensitive = true # Negated word boundaries shouldn't split a codepoint, but they will match # between invalid UTF-8. # # This test is only valid for a 'bytes' API, but that doesn't (yet) exist in # regex-lite. This can't happen in the main API because &str can't contain # invalid UTF-8. # [[test]] # name = "word-boundary-invalid-utf8" # regex = '\B' # haystack = '\xFF\xFF\xFF\xFF' # unescape = true # matches = [[0, 0], [1, 1], [2, 2], [3, 3], [4, 4]] # unicode = true # utf8 = false ����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������regex-1.12.2/testdata/regression.toml���������������������������������������������������������������0000644�0000000�0000000�00000055362�10461020230�0015722�0����������������������������������������������������������������������������������������������������ustar �����������������������������������������������������������������0000000�0000000������������������������������������������������������������������������������������������������������������������������������������������������������������������������# See: https://github.com/rust-lang/regex/issues/48 [[test]] name = "invalid-regex-no-crash-100" regex = '(*)' haystack = "" matches = [] compiles = false # See: https://github.com/rust-lang/regex/issues/48 [[test]] name = "invalid-regex-no-crash-200" regex = '(?:?)' haystack = "" matches = [] compiles = false # See: https://github.com/rust-lang/regex/issues/48 [[test]] name = "invalid-regex-no-crash-300" regex = '(?)' haystack = "" matches = [] compiles = false # See: https://github.com/rust-lang/regex/issues/48 [[test]] name = "invalid-regex-no-crash-400" regex = '*' haystack = "" matches = [] compiles = false # See: https://github.com/rust-lang/regex/issues/75 [[test]] name = "unsorted-binary-search-100" regex = '(?i-u)[a_]+' haystack = "A_" matches = [[0, 2]] # See: https://github.com/rust-lang/regex/issues/75 [[test]] name = "unsorted-binary-search-200" regex = '(?i-u)[A_]+' haystack = "a_" matches = [[0, 2]] # See: https://github.com/rust-lang/regex/issues/76 [[test]] name = "unicode-case-lower-nocase-flag" regex = '(?i)\p{Ll}+' haystack = "ฮ›ฮ˜ฮ“ฮ”ฮฑ" matches = [[0, 10]] # See: https://github.com/rust-lang/regex/issues/99 [[test]] name = "negated-char-class-100" regex = '(?i)[^x]' haystack = "x" matches = [] # See: https://github.com/rust-lang/regex/issues/99 [[test]] name = "negated-char-class-200" regex = '(?i)[^x]' haystack = "X" matches = [] # See: https://github.com/rust-lang/regex/issues/101 [[test]] name = "ascii-word-underscore" regex = '[[:word:]]' haystack = "_" matches = [[0, 1]] # See: https://github.com/rust-lang/regex/issues/129 [[test]] name = "captures-repeat" regex = '([a-f]){2}(?P<foo>[x-z])' haystack = "abx" matches = [ [[0, 3], [1, 2], [2, 3]], ] # See: https://github.com/rust-lang/regex/issues/153 [[test]] name = "alt-in-alt-100" regex = 'ab?|$' haystack = "az" matches = [[0, 1], [2, 2]] # See: https://github.com/rust-lang/regex/issues/153 [[test]] name = "alt-in-alt-200" regex = '^(?:.*?)(?:\n|\r\n?|$)' haystack = "ab\rcd" matches = [[0, 3]] # See: https://github.com/rust-lang/regex/issues/169 [[test]] name = "leftmost-first-prefix" regex = 'z*azb' haystack = "azb" matches = [[0, 3]] # See: https://github.com/rust-lang/regex/issues/191 [[test]] name = "many-alternates" regex = '1|2|3|4|5|6|7|8|9|10|int' haystack = "int" matches = [[0, 3]] # See: https://github.com/rust-lang/regex/issues/204 [[test]] name = "word-boundary-alone-100" regex = '\b' haystack = "Should this (work?)" matches = [[0, 0], [6, 6], [7, 7], [11, 11], [13, 13], [17, 17]] # See: https://github.com/rust-lang/regex/issues/204 [[test]] name = "word-boundary-alone-200" regex = '\b' haystack = "a b c" matches = [[0, 0], [1, 1], [2, 2], [3, 3], [4, 4], [5, 5]] # See: https://github.com/rust-lang/regex/issues/264 [[test]] name = "word-boundary-ascii-no-capture" regex = '\B' haystack = "\U00028F3E" matches = [[0, 0], [1, 1], [2, 2], [3, 3], [4, 4]] unicode = false utf8 = false # See: https://github.com/rust-lang/regex/issues/264 [[test]] name = "word-boundary-ascii-capture" regex = '(?:\B)' haystack = "\U00028F3E" matches = [[0, 0], [1, 1], [2, 2], [3, 3], [4, 4]] unicode = false utf8 = false # See: https://github.com/rust-lang/regex/issues/268 [[test]] name = "partial-anchor" regex = '^a|b' haystack = "ba" matches = [[0, 1]] # See: https://github.com/rust-lang/regex/issues/271 [[test]] name = "endl-or-word-boundary" regex = '(?m:$)|(?-u:\b)' haystack = "\U0006084E" matches = [[4, 4]] # See: https://github.com/rust-lang/regex/issues/271 [[test]] name = "zero-or-end" regex = '(?i-u:\x00)|$' haystack = "\U000E682F" matches = [[4, 4]] # See: https://github.com/rust-lang/regex/issues/271 [[test]] name = "y-or-endl" regex = '(?i-u:y)|(?m:$)' haystack = "\U000B4331" matches = [[4, 4]] # See: https://github.com/rust-lang/regex/issues/271 [[test]] name = "word-boundary-start-x" regex = '(?u:\b)^(?-u:X)' haystack = "X" matches = [[0, 1]] # See: https://github.com/rust-lang/regex/issues/271 [[test]] name = "word-boundary-ascii-start-x" regex = '(?-u:\b)^(?-u:X)' haystack = "X" matches = [[0, 1]] # See: https://github.com/rust-lang/regex/issues/271 [[test]] name = "end-not-word-boundary" regex = '$\B' haystack = "\U0005C124\U000B576C" matches = [[8, 8]] unicode = false utf8 = false # See: https://github.com/rust-lang/regex/issues/280 [[test]] name = "partial-anchor-alternate-begin" regex = '^a|z' haystack = "yyyyya" matches = [] # See: https://github.com/rust-lang/regex/issues/280 [[test]] name = "partial-anchor-alternate-end" regex = 'a$|z' haystack = "ayyyyy" matches = [] # See: https://github.com/rust-lang/regex/issues/289 [[test]] name = "lits-unambiguous-100" regex = '(?:ABC|CDA|BC)X' haystack = "CDAX" matches = [[0, 4]] # See: https://github.com/rust-lang/regex/issues/291 [[test]] name = "lits-unambiguous-200" regex = '((IMG|CAM|MG|MB2)_|(DSCN|CIMG))(?P<n>[0-9]+)$' haystack = "CIMG2341" matches = [ [[0, 8], [0, 4], [], [0, 4], [4, 8]], ] # See: https://github.com/rust-lang/regex/issues/303 # # 2022-09-19: This has now been "properly" fixed in that empty character # classes are fully supported as something that can never match. This test # used to be marked as 'compiles = false', but now it works. [[test]] name = "negated-full-byte-range" regex = '[^\x00-\xFF]' haystack = "" matches = [] compiles = true unicode = false utf8 = false # See: https://github.com/rust-lang/regex/issues/321 [[test]] name = "strange-anchor-non-complete-prefix" regex = 'a^{2}' haystack = "" matches = [] # See: https://github.com/rust-lang/regex/issues/321 [[test]] name = "strange-anchor-non-complete-suffix" regex = '${2}a' haystack = "" matches = [] # See: https://github.com/rust-lang/regex/issues/334 # See: https://github.com/rust-lang/regex/issues/557 [[test]] name = "captures-after-dfa-premature-end-100" regex = 'a(b*(X|$))?' haystack = "abcbX" matches = [ [[0, 1], [], []], ] # See: https://github.com/rust-lang/regex/issues/334 # See: https://github.com/rust-lang/regex/issues/557 [[test]] name = "captures-after-dfa-premature-end-200" regex = 'a(bc*(X|$))?' haystack = "abcbX" matches = [ [[0, 1], [], []], ] # See: https://github.com/rust-lang/regex/issues/334 # See: https://github.com/rust-lang/regex/issues/557 [[test]] name = "captures-after-dfa-premature-end-300" regex = '(aa$)?' haystack = "aaz" matches = [ [[0, 0], []], [[1, 1], []], [[2, 2], []], [[3, 3], []], ] # Plucked from "Why arenโ€™t regular expressions a lingua franca? an empirical # study on the re-use and portability of regular expressions", The ACM Joint # European Software Engineering Conference and Symposium on the Foundations of # Software Engineering (ESEC/FSE), 2019. # # Link: https://dl.acm.org/doi/pdf/10.1145/3338906.3338909 [[test]] name = "captures-after-dfa-premature-end-400" regex = '(a)\d*\.?\d+\b' haystack = "a0.0c" matches = [ [[0, 2], [0, 1]], ] # See: https://github.com/rust-lang/regex/issues/437 [[test]] name = "literal-panic" regex = 'typename type\-parameter\-[0-9]+\-[0-9]+::.+' haystack = "test" matches = [] # See: https://github.com/rust-lang/regex/issues/527 [[test]] name = "empty-flag-expr" regex = '(?:(?:(?x)))' haystack = "" matches = [[0, 0]] # See: https://github.com/rust-lang/regex/issues/533 #[[tests]] #name = "blank-matches-nothing-between-space-and-tab" #regex = '[[:blank:]]' #input = '\x0A\x0B\x0C\x0D\x0E\x0F\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1A\x1B\x1C\x1D\x1E\x1F' #match = false #unescape = true # See: https://github.com/rust-lang/regex/issues/533 #[[tests]] #name = "blank-matches-nothing-between-space-and-tab-inverted" #regex = '^[[:^blank:]]+$' #input = '\x0A\x0B\x0C\x0D\x0E\x0F\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1A\x1B\x1C\x1D\x1E\x1F' #match = true #unescape = true # See: https://github.com/rust-lang/regex/issues/555 [[test]] name = "invalid-repetition" regex = '(?m){1,1}' haystack = "" matches = [] compiles = false # See: https://github.com/rust-lang/regex/issues/640 [[test]] name = "flags-are-unset" regex = '(?:(?i)foo)|Bar' haystack = "foo Foo bar Bar" matches = [[0, 3], [4, 7], [12, 15]] # Note that 'ะˆ' is not 'j', but cyrillic Je # https://en.wikipedia.org/wiki/Je_(Cyrillic) # # See: https://github.com/rust-lang/regex/issues/659 [[test]] name = "empty-group-with-unicode" regex = '(?:)ะˆ01' haystack = 'zะˆ01' matches = [[1, 5]] # See: https://github.com/rust-lang/regex/issues/579 [[test]] name = "word-boundary-weird" regex = '\b..\b' haystack = "I have 12, he has 2!" matches = [[0, 2], [7, 9], [9, 11], [11, 13], [17, 19]] # See: https://github.com/rust-lang/regex/issues/579 [[test]] name = "word-boundary-weird-ascii" regex = '\b..\b' haystack = "I have 12, he has 2!" matches = [[0, 2], [7, 9], [9, 11], [11, 13], [17, 19]] unicode = false utf8 = false # See: https://github.com/rust-lang/regex/issues/579 [[test]] name = "word-boundary-weird-minimal-ascii" regex = '\b..\b' haystack = "az,,b" matches = [[0, 2], [2, 4]] unicode = false utf8 = false # See: https://github.com/BurntSushi/ripgrep/issues/1203 [[test]] name = "reverse-suffix-100" regex = '[0-4][0-4][0-4]000' haystack = "153.230000" matches = [[4, 10]] # See: https://github.com/BurntSushi/ripgrep/issues/1203 [[test]] name = "reverse-suffix-200" regex = '[0-9][0-9][0-9]000' haystack = "153.230000\n" matches = [[4, 10]] # This is a tricky case for the reverse suffix optimization, because it # finds the 'foobar' match but the reverse scan must fail to find a match by # correctly dealing with the word boundary following the 'foobar' literal when # computing the start state. # # This test exists because I tried to break the following assumption that # is currently in the code: that if a suffix is found and the reverse scan # succeeds, then it's guaranteed that there is an overall match. Namely, the # 'is_match' routine does *not* do another forward scan in this case because of # this assumption. [[test]] name = "reverse-suffix-300" regex = '\w+foobar\b' haystack = "xyzfoobarZ" matches = [] unicode = false utf8 = false # See: https://github.com/BurntSushi/ripgrep/issues/1247 [[test]] name = "stops" regex = '\bs(?:[ab])' haystack = 's\xE4' matches = [] unescape = true utf8 = false # See: https://github.com/BurntSushi/ripgrep/issues/1247 [[test]] name = "stops-ascii" regex = '(?-u:\b)s(?:[ab])' haystack = 's\xE4' matches = [] unescape = true utf8 = false # See: https://github.com/rust-lang/regex/issues/850 [[test]] name = "adjacent-line-boundary-100" regex = '(?m)^(?:[^ ]+?)$' haystack = "line1\nline2" matches = [[0, 5], [6, 11]] # Continued. [[test]] name = "adjacent-line-boundary-200" regex = '(?m)^(?:[^ ]+?)$' haystack = "A\nB" matches = [[0, 1], [2, 3]] # There is no issue for this bug. [[test]] name = "anchored-prefix-100" regex = '^a[[:^space:]]' haystack = "a " matches = [] # There is no issue for this bug. [[test]] name = "anchored-prefix-200" regex = '^a[[:^space:]]' haystack = "foo boo a" matches = [] # There is no issue for this bug. [[test]] name = "anchored-prefix-300" regex = '^-[a-z]' haystack = "r-f" matches = [] # Tests that a possible Aho-Corasick optimization works correctly. It only # kicks in when we have a lot of literals. By "works correctly," we mean that # leftmost-first match semantics are properly respected. That is, samwise # should match, not sam. # # There is no issue for this bug. [[test]] name = "aho-corasick-100" regex = 'samwise|sam|a|b|c|d|e|f|g|h|i|j|k|l|m|n|o|p|q|r|s|t|u|v|w|x|y|z|A|B|C|D|E|F|G|H|I|J|K|L|M|N|O|P|Q|R|S|T|U|V|W|X|Y|Z' haystack = "samwise" matches = [[0, 7]] # See: https://github.com/rust-lang/regex/issues/921 [[test]] name = "interior-anchor-capture" regex = '(a$)b$' haystack = 'ab' matches = [] # I found this bug in the course of adding some of the regexes that Ruff uses # to rebar. It turns out that the lazy DFA was finding a match that was being # rejected by the one-pass DFA. Yikes. I then minimized the regex and haystack. # # Source: https://github.com/charliermarsh/ruff/blob/a919041ddaa64cdf6f216f90dd0480dab69fd3ba/crates/ruff/src/rules/pycodestyle/rules/whitespace_around_keywords.rs#L52 [[test]] name = "ruff-whitespace-around-keywords" regex = '^(a|ab)$' haystack = "ab" anchored = true unicode = false utf8 = true matches = [[[0, 2], [0, 2]]] # From: https://github.com/rust-lang/regex/issues/429 [[test]] name = "i429-0" regex = '(?:(?-u:\b)|(?u:h))+' haystack = "h" unicode = true utf8 = false matches = [[0, 0], [1, 1]] # From: https://github.com/rust-lang/regex/issues/429 [[test]] name = "i429-1" regex = '(?u:\B)' haystack = "้‹ธ" unicode = true utf8 = false matches = [] # From: https://github.com/rust-lang/regex/issues/429 [[test]] name = "i429-2" regex = '(?:(?u:\b)|(?s-u:.))+' haystack = "oB" unicode = true utf8 = false matches = [[0, 0], [1, 2]] # From: https://github.com/rust-lang/regex/issues/429 [[test]] name = "i429-3" regex = '(?:(?-u:\B)|(?su:.))+' haystack = "\U000FEF80" unicode = true utf8 = false matches = [[0, 0], [1, 1], [2, 2], [3, 3], [4, 4]] # From: https://github.com/rust-lang/regex/issues/429 [[test]] name = "i429-3-utf8" regex = '(?:(?-u:\B)|(?su:.))+' haystack = "\U000FEF80" unicode = true utf8 = true matches = [[0, 0], [4, 4]] # From: https://github.com/rust-lang/regex/issues/429 [[test]] name = "i429-4" regex = '(?m:$)(?m:^)(?su:.)' haystack = "\nโ€ฃ" unicode = true utf8 = false matches = [[0, 1]] # From: https://github.com/rust-lang/regex/issues/429 [[test]] name = "i429-5" regex = '(?m:$)^(?m:^)' haystack = "\n" unicode = true utf8 = false matches = [[0, 0]] # From: https://github.com/rust-lang/regex/issues/429 [[test]] name = "i429-6" regex = '(?P<kp>(?iu:do)(?m:$))*' haystack = "dodo" unicode = true utf8 = false matches = [ [[0, 0], []], [[1, 1], []], [[2, 4], [2, 4]], ] # From: https://github.com/rust-lang/regex/issues/429 [[test]] name = "i429-7" regex = '(?u:\B)' haystack = "ไก" unicode = true utf8 = false matches = [] # From: https://github.com/rust-lang/regex/issues/429 [[test]] name = "i429-8" regex = '(?:(?-u:\b)|(?u:[\u{0}-W]))+' haystack = "0" unicode = true utf8 = false matches = [[0, 0], [1, 1]] # From: https://github.com/rust-lang/regex/issues/429 [[test]] name = "i429-9" regex = '((?m:$)(?-u:\B)(?s-u:.)(?-u:\B)$)' haystack = "\n\n" unicode = true utf8 = false matches = [ [[1, 2], [1, 2]], ] # From: https://github.com/rust-lang/regex/issues/429 [[test]] name = "i429-10" regex = '(?m:$)(?m:$)^(?su:.)' haystack = "\n\u0081ยจ\u200a" unicode = true utf8 = false matches = [[0, 1]] # From: https://github.com/rust-lang/regex/issues/429 [[test]] name = "i429-11" regex = '(?-u:\B)(?m:^)' haystack = "0\n" unicode = true utf8 = false matches = [[2, 2]] # From: https://github.com/rust-lang/regex/issues/429 [[test]] name = "i429-12" regex = '(?:(?u:\b)|(?-u:.))+' haystack = "0" unicode = true utf8 = false matches = [[0, 0], [1, 1]] # From: https://github.com/rust-lang/regex/issues/969 [[test]] name = "i969" regex = 'c.*d\z' haystack = "ababcd" bounds = [4, 6] search-kind = "earliest" matches = [[4, 6]] # I found this during the regex-automata migration. This is the fowler basic # 154 test, but without anchored = true and without a match limit. # # This test caught a subtle bug in the hybrid reverse DFA search, where it # would skip over the termination condition if it entered a start state. This # was a double bug. Firstly, the reverse DFA shouldn't have had start states # specialized in the first place, and thus it shouldn't have possible to detect # that the DFA had entered a start state. The second bug was that the start # state handling was incorrect by jumping over the termination condition. [[test]] name = "fowler-basic154-unanchored" regex = '''a([bc]*)c*''' haystack = '''abc''' matches = [[[0, 3], [1, 3]]] # From: https://github.com/rust-lang/regex/issues/981 # # This was never really a problem in the new architecture because the # regex-automata engines are far more principled about how they deal with # look-around. (This was one of the many reasons I wanted to re-work the # original regex crate engines.) [[test]] name = "word-boundary-interact-poorly-with-literal-optimizations" regex = '(?i:(?:\b|_)win(?:32|64|dows)?(?:\b|_))' haystack = 'ubi-Darwin-x86_64.tar.gz' matches = [] # This was found during fuzz testing of regex. It provoked a panic in the meta # engine as a result of the reverse suffix optimization. Namely, it hit a case # where a suffix match was found, a corresponding reverse match was found, but # the forward search turned up no match. The forward search should always match # if the suffix and reverse search match. # # This in turn uncovered an inconsistency between the PikeVM and the DFA (lazy # and fully compiled) engines. It was caused by a mishandling of the collection # of NFA state IDs in the generic determinization code (which is why both types # of DFA were impacted). Namely, when a fail state was encountered (that's the # `[^\s\S]` in the pattern below), then it would just stop collecting states. # But that's not correct since a later state could lead to a match. [[test]] name = "impossible-branch" regex = '.*[^\s\S]A|B' haystack = "B" matches = [[0, 1]] # This was found during fuzz testing in regex-lite. The regex crate never # suffered from this bug, but it causes regex-lite to incorrectly compile # captures. [[test]] name = "captures-wrong-order" regex = '(a){0}(a)' haystack = 'a' matches = [[[0, 1], [], [0, 1]]] # This tests a bug in how quit states are handled in the DFA. At some point # during development, the DFAs were tweaked slightly such that if they hit # a quit state (which means, they hit a byte that the caller configured should # stop the search), then it might not return an error necessarily. Namely, if a # match had already been found, then it would be returned instead of an error. # # But this is actually wrong! Why? Because even though a match had been found, # it wouldn't be fully correct to return it once a quit state has been seen # because you can't determine whether the match offset returned is the correct # greedy/leftmost-first match. Since you can't complete the search as requested # by the caller, the DFA should just stop and return an error. # # Interestingly, this does seem to produce an unavoidable difference between # 'try_is_match().unwrap()' and 'try_find().unwrap().is_some()' for the DFAs. # The former will stop immediately once a match is known to occur and return # 'Ok(true)', where as the latter could find the match but quit with an # 'Err(..)' first. # # Thankfully, I believe this inconsistency between 'is_match()' and 'find()' # cannot be observed in the higher level meta regex API because it specifically # will try another engine that won't fail in the case of a DFA failing. # # This regression happened in the regex crate rewrite, but before anything got # released. [[test]] name = "negated-unicode-word-boundary-dfa-fail" regex = '\B.*' haystack = "!\u02D7" matches = [[0, 3]] # This failure was found in the *old* regex crate (prior to regex 1.9), but # I didn't investigate why. My best guess is that it's a literal optimization # bug. It didn't occur in the rewrite. [[test]] name = "missed-match" regex = 'e..+e.ee>' haystack = 'Zeee.eZZZZZZZZeee>eeeeeee>' matches = [[1, 26]] # This test came from the 'ignore' crate and tripped a bug in how accelerated # DFA states were handled in an overlapping search. [[test]] name = "regex-to-glob" regex = ['(?-u)^path1/[^/]*$'] haystack = "path1/foo" matches = [[0, 9]] utf8 = false match-kind = "all" search-kind = "overlapping" # See: https://github.com/rust-lang/regex/issues/1060 [[test]] name = "reverse-inner-plus-shorter-than-expected" regex = '(?:(\d+)[:.])?(\d{1,2})[:.](\d{2})' haystack = '102:12:39' matches = [[[0, 9], [0, 3], [4, 6], [7, 9]]] # Like reverse-inner-plus-shorter-than-expected, but using a far simpler regex # to demonstrate the extent of the rot. Sigh. # # See: https://github.com/rust-lang/regex/issues/1060 [[test]] name = "reverse-inner-short" regex = '(?:([0-9][0-9][0-9]):)?([0-9][0-9]):([0-9][0-9])' haystack = '102:12:39' matches = [[[0, 9], [0, 3], [4, 6], [7, 9]]] # This regression test was found via the RegexSet APIs. It triggered a # particular code path where a regex was compiled with 'All' match semantics # (to support overlapping search), but got funneled down into a standard # leftmost search when calling 'is_match'. This is fine on its own, but the # leftmost search will use a prefilter and that's where this went awry. # # Namely, since 'All' semantics were used, the aho-corasick prefilter was # incorrectly compiled with 'Standard' semantics. This was wrong because # 'Standard' immediately attempts to report a match at every position, even if # that would mean reporting a match past the leftmost match before reporting # the leftmost match. This breaks the prefilter contract of never having false # negatives and leads overall to the engine not finding a match. # # See: https://github.com/rust-lang/regex/issues/1070 [[test]] name = "prefilter-with-aho-corasick-standard-semantics" regex = '(?m)^ *v [0-9]' haystack = 'v 0' matches = [ { id = 0, spans = [[0, 3]] }, ] match-kind = "all" search-kind = "overlapping" unicode = true utf8 = true # This tests that the PikeVM and the meta regex agree on a particular regex. # This test previously failed when the ad hoc engines inside the meta engine # did not handle quit states correctly. Namely, the Unicode word boundary here # combined with a non-ASCII codepoint provokes the quit state. The ad hoc # engines were previously returning a match even after entering the quit state # if a match had been previously detected, but this is incorrect. The reason # is that if a quit state is found, then the search must give up *immediately* # because it prevents the search from finding the "proper" leftmost-first # match. If it instead returns a match that has been found, it risks reporting # an improper match, as it did in this case. # # See: https://github.com/rust-lang/regex/issues/1046 [[test]] name = "non-prefix-literal-quit-state" regex = '.+\b\n' haystack = "ฮฒ77\n" matches = [[0, 5]] # This is a regression test for some errant HIR interval set operations that # were made in the regex-syntax 0.8.0 release and then reverted in 0.8.1. The # issue here is that the HIR produced from the regex had out-of-order ranges. # # See: https://github.com/rust-lang/regex/issues/1103 # Ref: https://github.com/rust-lang/regex/pull/1051 # Ref: https://github.com/rust-lang/regex/pull/1102 [[test]] name = "hir-optimization-out-of-order-class" regex = '^[[:alnum:]./-]+$' haystack = "a-b" matches = [[0, 3]] # This is a regression test for an improper reverse suffix optimization. This # occurred when I "broadened" the applicability of the optimization to include # multiple possible literal suffixes instead of only sticking to a non-empty # longest common suffix. It turns out that, at least given how the reverse # suffix optimization works, we need to stick to the longest common suffix for # now. # # See: https://github.com/rust-lang/regex/issues/1110 # See also: https://github.com/astral-sh/ruff/pull/7980 [[test]] name = 'improper-reverse-suffix-optimization' regex = '(\\N\{[^}]+})|([{}])' haystack = 'hiya \N{snowman} bye' matches = [[[5, 16], [5, 16], []]] ������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������regex-1.12.2/testdata/set.toml����������������������������������������������������������������������0000644�0000000�0000000�00000030337�10461020230�0014330�0����������������������������������������������������������������������������������������������������ustar �����������������������������������������������������������������0000000�0000000������������������������������������������������������������������������������������������������������������������������������������������������������������������������# Basic multi-regex tests. [[test]] name = "basic10" regex = ["a", "a"] haystack = "a" matches = [ { id = 0, span = [0, 1] }, { id = 1, span = [0, 1] }, ] match-kind = "all" search-kind = "overlapping" [[test]] name = "basic10-leftmost-first" regex = ["a", "a"] haystack = "a" matches = [ { id = 0, span = [0, 1] }, ] match-kind = "leftmost-first" search-kind = "leftmost" [[test]] name = "basic20" regex = ["a", "a"] haystack = "ba" matches = [ { id = 0, span = [1, 2] }, { id = 1, span = [1, 2] }, ] match-kind = "all" search-kind = "overlapping" [[test]] name = "basic30" regex = ["a", "b"] haystack = "a" matches = [ { id = 0, span = [0, 1] }, ] match-kind = "all" search-kind = "overlapping" [[test]] name = "basic40" regex = ["a", "b"] haystack = "b" matches = [ { id = 1, span = [0, 1] }, ] match-kind = "all" search-kind = "overlapping" [[test]] name = "basic50" regex = ["a|b", "b|a"] haystack = "b" matches = [ { id = 0, span = [0, 1] }, { id = 1, span = [0, 1] }, ] match-kind = "all" search-kind = "overlapping" [[test]] name = "basic60" regex = ["foo", "oo"] haystack = "foo" matches = [ { id = 0, span = [0, 3] }, { id = 1, span = [1, 3] }, ] match-kind = "all" search-kind = "overlapping" [[test]] name = "basic60-leftmost-first" regex = ["foo", "oo"] haystack = "foo" matches = [ { id = 0, span = [0, 3] }, ] match-kind = "leftmost-first" search-kind = "leftmost" [[test]] name = "basic61" regex = ["oo", "foo"] haystack = "foo" matches = [ { id = 1, span = [0, 3] }, { id = 0, span = [1, 3] }, ] match-kind = "all" search-kind = "overlapping" [[test]] name = "basic61-leftmost-first" regex = ["oo", "foo"] haystack = "foo" matches = [ { id = 1, span = [0, 3] }, ] match-kind = "leftmost-first" search-kind = "leftmost" [[test]] name = "basic70" regex = ["abcd", "bcd", "cd", "d"] haystack = "abcd" matches = [ { id = 0, span = [0, 4] }, { id = 1, span = [1, 4] }, { id = 2, span = [2, 4] }, { id = 3, span = [3, 4] }, ] match-kind = "all" search-kind = "overlapping" [[test]] name = "basic71" regex = ["bcd", "cd", "d", "abcd"] haystack = "abcd" matches = [ { id = 3, span = [0, 4] }, ] match-kind = "leftmost-first" search-kind = "leftmost" [[test]] name = "basic80" regex = ["^foo", "bar$"] haystack = "foo" matches = [ { id = 0, span = [0, 3] }, ] match-kind = "all" search-kind = "overlapping" [[test]] name = "basic81" regex = ["^foo", "bar$"] haystack = "foo bar" matches = [ { id = 0, span = [0, 3] }, { id = 1, span = [4, 7] }, ] match-kind = "all" search-kind = "overlapping" [[test]] name = "basic82" regex = ["^foo", "bar$"] haystack = "bar" matches = [ { id = 1, span = [0, 3] }, ] match-kind = "all" search-kind = "overlapping" [[test]] name = "basic90" regex = ["[a-z]+$", "foo"] haystack = "01234 foo" matches = [ { id = 0, span = [8, 9] }, { id = 0, span = [7, 9] }, { id = 0, span = [6, 9] }, { id = 1, span = [6, 9] }, ] match-kind = "all" search-kind = "overlapping" [[test]] name = "basic91" regex = ["[a-z]+$", "foo"] haystack = "foo 01234" matches = [ { id = 1, span = [0, 3] }, ] match-kind = "all" search-kind = "overlapping" [[test]] name = "basic100" regex = [".*?", "a"] haystack = "zzza" matches = [ { id = 0, span = [0, 0] }, { id = 0, span = [1, 1] }, { id = 0, span = [0, 1] }, { id = 0, span = [2, 2] }, { id = 0, span = [1, 2] }, { id = 0, span = [0, 2] }, { id = 0, span = [3, 3] }, { id = 0, span = [2, 3] }, { id = 0, span = [1, 3] }, { id = 0, span = [0, 3] }, { id = 0, span = [4, 4] }, { id = 0, span = [3, 4] }, { id = 0, span = [2, 4] }, { id = 0, span = [1, 4] }, { id = 0, span = [0, 4] }, { id = 1, span = [3, 4] }, ] match-kind = "all" search-kind = "overlapping" [[test]] name = "basic101" regex = [".*", "a"] haystack = "zzza" matches = [ { id = 0, span = [0, 0] }, { id = 0, span = [1, 1] }, { id = 0, span = [0, 1] }, { id = 0, span = [2, 2] }, { id = 0, span = [1, 2] }, { id = 0, span = [0, 2] }, { id = 0, span = [3, 3] }, { id = 0, span = [2, 3] }, { id = 0, span = [1, 3] }, { id = 0, span = [0, 3] }, { id = 0, span = [4, 4] }, { id = 0, span = [3, 4] }, { id = 0, span = [2, 4] }, { id = 0, span = [1, 4] }, { id = 0, span = [0, 4] }, { id = 1, span = [3, 4] }, ] match-kind = "all" search-kind = "overlapping" [[test]] name = "basic102" regex = [".*", "a"] haystack = "zzz" matches = [ { id = 0, span = [0, 0] }, { id = 0, span = [1, 1] }, { id = 0, span = [0, 1] }, { id = 0, span = [2, 2] }, { id = 0, span = [1, 2] }, { id = 0, span = [0, 2] }, { id = 0, span = [3, 3] }, { id = 0, span = [2, 3] }, { id = 0, span = [1, 3] }, { id = 0, span = [0, 3] }, ] match-kind = "all" search-kind = "overlapping" [[test]] name = "basic110" regex = ['\ba\b'] haystack = "hello a bye" matches = [ { id = 0, span = [6, 7] }, ] match-kind = "all" search-kind = "overlapping" [[test]] name = "basic111" regex = ['\ba\b', '\be\b'] haystack = "hello a bye e" matches = [ { id = 0, span = [6, 7] }, { id = 1, span = [12, 13] }, ] match-kind = "all" search-kind = "overlapping" [[test]] name = "basic120" regex = ["a"] haystack = "a" matches = [ { id = 0, span = [0, 1] }, ] match-kind = "all" search-kind = "overlapping" [[test]] name = "basic121" regex = [".*a"] haystack = "a" matches = [ { id = 0, span = [0, 1] }, ] match-kind = "all" search-kind = "overlapping" [[test]] name = "basic122" regex = [".*a", "ฮฒ"] haystack = "ฮฒ" matches = [ { id = 1, span = [0, 2] }, ] match-kind = "all" search-kind = "overlapping" [[test]] name = "basic130" regex = ["ab", "b"] haystack = "ba" matches = [ { id = 1, span = [0, 1] }, ] match-kind = "all" search-kind = "overlapping" # These test cases where one of the regexes matches the empty string. [[test]] name = "empty10" regex = ["", "a"] haystack = "abc" matches = [ { id = 0, span = [0, 0] }, { id = 1, span = [0, 1] }, { id = 0, span = [1, 1] }, { id = 0, span = [2, 2] }, { id = 0, span = [3, 3] }, ] match-kind = "all" search-kind = "overlapping" [[test]] name = "empty10-leftmost-first" regex = ["", "a"] haystack = "abc" matches = [ { id = 0, span = [0, 0] }, { id = 0, span = [1, 1] }, { id = 0, span = [2, 2] }, { id = 0, span = [3, 3] }, ] match-kind = "leftmost-first" search-kind = "leftmost" [[test]] name = "empty11" regex = ["a", ""] haystack = "abc" matches = [ { id = 1, span = [0, 0] }, { id = 0, span = [0, 1] }, { id = 1, span = [1, 1] }, { id = 1, span = [2, 2] }, { id = 1, span = [3, 3] }, ] match-kind = "all" search-kind = "overlapping" [[test]] name = "empty11-leftmost-first" regex = ["a", ""] haystack = "abc" matches = [ { id = 0, span = [0, 1] }, { id = 1, span = [2, 2] }, { id = 1, span = [3, 3] }, ] match-kind = "leftmost-first" search-kind = "leftmost" [[test]] name = "empty20" regex = ["", "b"] haystack = "abc" matches = [ { id = 0, span = [0, 0] }, { id = 0, span = [1, 1] }, { id = 1, span = [1, 2] }, { id = 0, span = [2, 2] }, { id = 0, span = [3, 3] }, ] match-kind = "all" search-kind = "overlapping" [[test]] name = "empty20-leftmost-first" regex = ["", "b"] haystack = "abc" matches = [ { id = 0, span = [0, 0] }, { id = 0, span = [1, 1] }, { id = 0, span = [2, 2] }, { id = 0, span = [3, 3] }, ] match-kind = "leftmost-first" search-kind = "leftmost" [[test]] name = "empty21" regex = ["b", ""] haystack = "abc" matches = [ { id = 1, span = [0, 0] }, { id = 1, span = [1, 1] }, { id = 0, span = [1, 2] }, { id = 1, span = [2, 2] }, { id = 1, span = [3, 3] }, ] match-kind = "all" search-kind = "overlapping" [[test]] name = "empty21-leftmost-first" regex = ["b", ""] haystack = "abc" matches = [ { id = 1, span = [0, 0] }, { id = 0, span = [1, 2] }, { id = 1, span = [3, 3] }, ] match-kind = "leftmost-first" search-kind = "leftmost" [[test]] name = "empty22" regex = ["(?:)", "b"] haystack = "abc" matches = [ { id = 0, span = [0, 0] }, { id = 0, span = [1, 1] }, { id = 1, span = [1, 2] }, { id = 0, span = [2, 2] }, { id = 0, span = [3, 3] }, ] match-kind = "all" search-kind = "overlapping" [[test]] name = "empty23" regex = ["b", "(?:)"] haystack = "abc" matches = [ { id = 1, span = [0, 0] }, { id = 1, span = [1, 1] }, { id = 0, span = [1, 2] }, { id = 1, span = [2, 2] }, { id = 1, span = [3, 3] }, ] match-kind = "all" search-kind = "overlapping" [[test]] name = "empty30" regex = ["", "z"] haystack = "abc" matches = [ { id = 0, span = [0, 0] }, { id = 0, span = [1, 1] }, { id = 0, span = [2, 2] }, { id = 0, span = [3, 3] }, ] match-kind = "all" search-kind = "overlapping" [[test]] name = "empty30-leftmost-first" regex = ["", "z"] haystack = "abc" matches = [ { id = 0, span = [0, 0] }, { id = 0, span = [1, 1] }, { id = 0, span = [2, 2] }, { id = 0, span = [3, 3] }, ] match-kind = "leftmost-first" search-kind = "leftmost" [[test]] name = "empty31" regex = ["z", ""] haystack = "abc" matches = [ { id = 1, span = [0, 0] }, { id = 1, span = [1, 1] }, { id = 1, span = [2, 2] }, { id = 1, span = [3, 3] }, ] match-kind = "all" search-kind = "overlapping" [[test]] name = "empty31-leftmost-first" regex = ["z", ""] haystack = "abc" matches = [ { id = 1, span = [0, 0] }, { id = 1, span = [1, 1] }, { id = 1, span = [2, 2] }, { id = 1, span = [3, 3] }, ] match-kind = "leftmost-first" search-kind = "leftmost" [[test]] name = "empty40" regex = ["c(?:)", "b"] haystack = "abc" matches = [ { id = 1, span = [1, 2] }, { id = 0, span = [2, 3] }, ] match-kind = "all" search-kind = "overlapping" [[test]] name = "empty40-leftmost-first" regex = ["c(?:)", "b"] haystack = "abc" matches = [ { id = 1, span = [1, 2] }, { id = 0, span = [2, 3] }, ] match-kind = "leftmost-first" search-kind = "leftmost" # These test cases where there are no matches. [[test]] name = "nomatch10" regex = ["a", "a"] haystack = "b" matches = [] match-kind = "all" search-kind = "overlapping" [[test]] name = "nomatch20" regex = ["^foo", "bar$"] haystack = "bar foo" matches = [] match-kind = "all" search-kind = "overlapping" [[test]] name = "nomatch30" regex = [] haystack = "a" matches = [] match-kind = "all" search-kind = "overlapping" [[test]] name = "nomatch40" regex = ["^rooted$", '\.log$'] haystack = "notrooted" matches = [] match-kind = "all" search-kind = "overlapping" # These test multi-regex searches with capture groups. # # NOTE: I wrote these tests in the course of developing a first class API for # overlapping capturing group matches, but ultimately removed that API because # the semantics for overlapping matches aren't totally clear. However, I've # left the tests because I believe the semantics for these patterns are clear # and because we can still test our "which patterns matched" APIs with them. [[test]] name = "caps-010" regex = ['^(\w+) (\w+)$', '^(\S+) (\S+)$'] haystack = "Bruce Springsteen" matches = [ { id = 0, spans = [[0, 17], [0, 5], [6, 17]] }, { id = 1, spans = [[0, 17], [0, 5], [6, 17]] }, ] match-kind = "all" search-kind = "overlapping" unicode = false utf8 = false [[test]] name = "caps-020" regex = ['^(\w+) (\w+)$', '^[A-Z](\S+) [A-Z](\S+)$'] haystack = "Bruce Springsteen" matches = [ { id = 0, spans = [[0, 17], [0, 5], [6, 17]] }, { id = 1, spans = [[0, 17], [1, 5], [7, 17]] }, ] match-kind = "all" search-kind = "overlapping" unicode = false utf8 = false [[test]] name = "caps-030" regex = ['^(\w+) (\w+)$', '^([A-Z])(\S+) ([A-Z])(\S+)$'] haystack = "Bruce Springsteen" matches = [ { id = 0, spans = [[0, 17], [0, 5], [6, 17]] }, { id = 1, spans = [[0, 17], [0, 1], [1, 5], [6, 7], [7, 17]] }, ] match-kind = "all" search-kind = "overlapping" unicode = false utf8 = false [[test]] name = "caps-110" regex = ['(\w+) (\w+)', '(\S+) (\S+)'] haystack = "Bruce Springsteen" matches = [ { id = 0, spans = [[0, 17], [0, 5], [6, 17]] }, ] match-kind = "leftmost-first" search-kind = "leftmost" unicode = false utf8 = false [[test]] name = "caps-120" regex = ['(\w+) (\w+)', '(\S+) (\S+)'] haystack = "&ruce $pringsteen" matches = [ { id = 1, spans = [[0, 17], [0, 5], [6, 17]] }, ] match-kind = "leftmost-first" search-kind = "leftmost" unicode = false utf8 = false [[test]] name = "caps-121" regex = ['(\w+) (\w+)', '(\S+) (\S+)'] haystack = "&ruce $pringsteen Foo Bar" matches = [ { id = 1, spans = [[0, 17], [0, 5], [6, 17]] }, { id = 0, spans = [[18, 25], [18, 21], [22, 25]] }, ] match-kind = "leftmost-first" search-kind = "leftmost" unicode = false utf8 = false �������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������regex-1.12.2/testdata/substring.toml����������������������������������������������������������������0000644�0000000�0000000�00000001741�10461020230�0015552�0����������������������������������������������������������������������������������������������������ustar �����������������������������������������������������������������0000000�0000000������������������������������������������������������������������������������������������������������������������������������������������������������������������������# These tests check that regex engines perform as expected when the search is # instructed to only search a substring of a haystack instead of the entire # haystack. This tends to exercise interesting edge cases that are otherwise # difficult to provoke. (But not necessarily impossible. Regex search iterators # for example, make use of the "search just a substring" APIs by changing the # starting position of a search to the end position of the previous match.) [[test]] name = "unicode-word-start" regex = '\b[0-9]+\b' haystack = "ฮฒ123" bounds = { start = 2, end = 5 } matches = [] [[test]] name = "unicode-word-end" regex = '\b[0-9]+\b' haystack = "123ฮฒ" bounds = { start = 0, end = 3 } matches = [] [[test]] name = "ascii-word-start" regex = '\b[0-9]+\b' haystack = "ฮฒ123" bounds = { start = 2, end = 5 } matches = [[2, 5]] unicode = false [[test]] name = "ascii-word-end" regex = '\b[0-9]+\b' haystack = "123ฮฒ" bounds = { start = 0, end = 3 } matches = [[0, 3]] unicode = false �������������������������������regex-1.12.2/testdata/unicode.toml������������������������������������������������������������������0000644�0000000�0000000�00000020160�10461020230�0015154�0����������������������������������������������������������������������������������������������������ustar �����������������������������������������������������������������0000000�0000000������������������������������������������������������������������������������������������������������������������������������������������������������������������������# Basic Unicode literal support. [[test]] name = "literal1" regex = 'โ˜ƒ' haystack = "โ˜ƒ" matches = [[0, 3]] [[test]] name = "literal2" regex = 'โ˜ƒ+' haystack = "โ˜ƒ" matches = [[0, 3]] [[test]] name = "literal3" regex = 'โ˜ƒ+' haystack = "โ˜ƒ" matches = [[0, 3]] case-insensitive = true [[test]] name = "literal4" regex = 'ฮ”' haystack = "ฮด" matches = [[0, 2]] case-insensitive = true # Unicode word boundaries. [[test]] name = "wb-100" regex = '\d\b' haystack = "6ฮด" matches = [] [[test]] name = "wb-200" regex = '\d\b' haystack = "6แš€" matches = [[0, 1]] [[test]] name = "wb-300" regex = '\d\B' haystack = "6ฮด" matches = [[0, 1]] [[test]] name = "wb-400" regex = '\d\B' haystack = "6แš€" matches = [] # Unicode character class support. [[test]] name = "class1" regex = '[โ˜ƒโ… ]+' haystack = "โ˜ƒ" matches = [[0, 3]] [[test]] name = "class2" regex = '\pN' haystack = "โ… " matches = [[0, 3]] [[test]] name = "class3" regex = '\pN+' haystack = "โ… 1โ…ก2" matches = [[0, 8]] [[test]] name = "class4" regex = '\PN+' haystack = "abโ… " matches = [[0, 2]] [[test]] name = "class5" regex = '[\PN]+' haystack = "abโ… " matches = [[0, 2]] [[test]] name = "class6" regex = '[^\PN]+' haystack = "abโ… " matches = [[2, 5]] [[test]] name = "class7" regex = '\p{Lu}+' haystack = "ฮ›ฮ˜ฮ“ฮ”ฮฑ" matches = [[0, 8]] [[test]] name = "class8" regex = '\p{Lu}+' haystack = "ฮ›ฮ˜ฮ“ฮ”ฮฑ" matches = [[0, 10]] case-insensitive = true [[test]] name = "class9" regex = '\pL+' haystack = "ฮ›ฮ˜ฮ“ฮ”ฮฑ" matches = [[0, 10]] [[test]] name = "class10" regex = '\p{Ll}+' haystack = "ฮ›ฮ˜ฮ“ฮ”ฮฑ" matches = [[8, 10]] # Unicode aware "Perl" character classes. [[test]] name = "perl1" regex = '\w+' haystack = "dฮดd" matches = [[0, 4]] [[test]] name = "perl2" regex = '\w+' haystack = "โฅก" matches = [] [[test]] name = "perl3" regex = '\W+' haystack = "โฅก" matches = [[0, 3]] [[test]] name = "perl4" regex = '\d+' haystack = "1เฅจเฅฉ9" matches = [[0, 8]] [[test]] name = "perl5" regex = '\d+' haystack = "โ…ก" matches = [] [[test]] name = "perl6" regex = '\D+' haystack = "โ…ก" matches = [[0, 3]] [[test]] name = "perl7" regex = '\s+' haystack = "แš€" matches = [[0, 3]] [[test]] name = "perl8" regex = '\s+' haystack = "โ˜ƒ" matches = [] [[test]] name = "perl9" regex = '\S+' haystack = "โ˜ƒ" matches = [[0, 3]] # Specific tests for Unicode general category classes. [[test]] name = "class-gencat1" regex = '\p{Cased_Letter}' haystack = "๏ผก" matches = [[0, 3]] [[test]] name = "class-gencat2" regex = '\p{Close_Punctuation}' haystack = "โฏ" matches = [[0, 3]] [[test]] name = "class-gencat3" regex = '\p{Connector_Punctuation}' haystack = "โ€" matches = [[0, 3]] [[test]] name = "class-gencat4" regex = '\p{Control}' haystack = "\u009F" matches = [[0, 2]] [[test]] name = "class-gencat5" regex = '\p{Currency_Symbol}' haystack = "๏ฟก" matches = [[0, 3]] [[test]] name = "class-gencat6" regex = '\p{Dash_Punctuation}' haystack = "ใ€ฐ" matches = [[0, 3]] [[test]] name = "class-gencat7" regex = '\p{Decimal_Number}' haystack = "๐‘“™" matches = [[0, 4]] [[test]] name = "class-gencat8" regex = '\p{Enclosing_Mark}' haystack = "\uA672" matches = [[0, 3]] [[test]] name = "class-gencat9" regex = '\p{Final_Punctuation}' haystack = "โธก" matches = [[0, 3]] [[test]] name = "class-gencat10" regex = '\p{Format}' haystack = "\U000E007F" matches = [[0, 4]] [[test]] name = "class-gencat11" regex = '\p{Initial_Punctuation}' haystack = "โธœ" matches = [[0, 3]] [[test]] name = "class-gencat12" regex = '\p{Letter}' haystack = "ฮˆ" matches = [[0, 2]] [[test]] name = "class-gencat13" regex = '\p{Letter_Number}' haystack = "โ†‚" matches = [[0, 3]] [[test]] name = "class-gencat14" regex = '\p{Line_Separator}' haystack = "\u2028" matches = [[0, 3]] [[test]] name = "class-gencat15" regex = '\p{Lowercase_Letter}' haystack = "ฯ›" matches = [[0, 2]] [[test]] name = "class-gencat16" regex = '\p{Mark}' haystack = "\U000E01EF" matches = [[0, 4]] [[test]] name = "class-gencat17" regex = '\p{Math}' haystack = "โ‹ฟ" matches = [[0, 3]] [[test]] name = "class-gencat18" regex = '\p{Modifier_Letter}' haystack = "๐–ญƒ" matches = [[0, 4]] [[test]] name = "class-gencat19" regex = '\p{Modifier_Symbol}' haystack = "๐Ÿฟ" matches = [[0, 4]] [[test]] name = "class-gencat20" regex = '\p{Nonspacing_Mark}' haystack = "\U0001E94A" matches = [[0, 4]] [[test]] name = "class-gencat21" regex = '\p{Number}' haystack = "โ“ฟ" matches = [[0, 3]] [[test]] name = "class-gencat22" regex = '\p{Open_Punctuation}' haystack = "๏ฝŸ" matches = [[0, 3]] [[test]] name = "class-gencat23" regex = '\p{Other}' haystack = "\u0BC9" matches = [[0, 3]] [[test]] name = "class-gencat24" regex = '\p{Other_Letter}' haystack = "๊“ท" matches = [[0, 3]] [[test]] name = "class-gencat25" regex = '\p{Other_Number}' haystack = "ใ‰" matches = [[0, 3]] [[test]] name = "class-gencat26" regex = '\p{Other_Punctuation}' haystack = "๐žฅž" matches = [[0, 4]] [[test]] name = "class-gencat27" regex = '\p{Other_Symbol}' haystack = "โ…Œ" matches = [[0, 3]] [[test]] name = "class-gencat28" regex = '\p{Paragraph_Separator}' haystack = "\u2029" matches = [[0, 3]] [[test]] name = "class-gencat29" regex = '\p{Private_Use}' haystack = "\U0010FFFD" matches = [[0, 4]] [[test]] name = "class-gencat30" regex = '\p{Punctuation}' haystack = "๐‘" matches = [[0, 4]] [[test]] name = "class-gencat31" regex = '\p{Separator}' haystack = "\u3000" matches = [[0, 3]] [[test]] name = "class-gencat32" regex = '\p{Space_Separator}' haystack = "\u205F" matches = [[0, 3]] [[test]] name = "class-gencat33" regex = '\p{Spacing_Mark}' haystack = "\U00016F7E" matches = [[0, 4]] [[test]] name = "class-gencat34" regex = '\p{Symbol}' haystack = "โฏˆ" matches = [[0, 3]] [[test]] name = "class-gencat35" regex = '\p{Titlecase_Letter}' haystack = "แฟผ" matches = [[0, 3]] [[test]] name = "class-gencat36" regex = '\p{Unassigned}' haystack = "\U0010FFFF" matches = [[0, 4]] [[test]] name = "class-gencat37" regex = '\p{Uppercase_Letter}' haystack = "๊Š" matches = [[0, 3]] # Tests for Unicode emoji properties. [[test]] name = "class-emoji1" regex = '\p{Emoji}' haystack = "\u23E9" matches = [[0, 3]] [[test]] name = "class-emoji2" regex = '\p{emoji}' haystack = "\U0001F21A" matches = [[0, 4]] [[test]] name = "class-emoji3" regex = '\p{extendedpictographic}' haystack = "\U0001FA6E" matches = [[0, 4]] [[test]] name = "class-emoji4" regex = '\p{extendedpictographic}' haystack = "\U0001FFFD" matches = [[0, 4]] # Tests for Unicode grapheme cluster properties. [[test]] name = "class-gcb1" regex = '\p{grapheme_cluster_break=prepend}' haystack = "\U00011D46" matches = [[0, 4]] [[test]] name = "class-gcb2" regex = '\p{gcb=regional_indicator}' haystack = "\U0001F1E6" matches = [[0, 4]] [[test]] name = "class-gcb3" regex = '\p{gcb=ri}' haystack = "\U0001F1E7" matches = [[0, 4]] [[test]] name = "class-gcb4" regex = '\p{regionalindicator}' haystack = "\U0001F1FF" matches = [[0, 4]] [[test]] name = "class-gcb5" regex = '\p{gcb=lvt}' haystack = "\uC989" matches = [[0, 3]] [[test]] name = "class-gcb6" regex = '\p{gcb=zwj}' haystack = "\u200D" matches = [[0, 3]] # Tests for Unicode word boundary properties. [[test]] name = "class-word-break1" regex = '\p{word_break=Hebrew_Letter}' haystack = "\uFB46" matches = [[0, 3]] [[test]] name = "class-word-break2" regex = '\p{wb=hebrewletter}' haystack = "\uFB46" matches = [[0, 3]] [[test]] name = "class-word-break3" regex = '\p{wb=ExtendNumLet}' haystack = "\uFF3F" matches = [[0, 3]] [[test]] name = "class-word-break4" regex = '\p{wb=WSegSpace}' haystack = "\u3000" matches = [[0, 3]] [[test]] name = "class-word-break5" regex = '\p{wb=numeric}' haystack = "\U0001E950" matches = [[0, 4]] # Tests for Unicode sentence boundary properties. [[test]] name = "class-sentence-break1" regex = '\p{sentence_break=Lower}' haystack = "\u0469" matches = [[0, 2]] [[test]] name = "class-sentence-break2" regex = '\p{sb=lower}' haystack = "\u0469" matches = [[0, 2]] [[test]] name = "class-sentence-break3" regex = '\p{sb=Close}' haystack = "\uFF60" matches = [[0, 3]] [[test]] name = "class-sentence-break4" regex = '\p{sb=Close}' haystack = "\U0001F677" matches = [[0, 4]] [[test]] name = "class-sentence-break5" regex = '\p{sb=SContinue}' haystack = "\uFF64" matches = [[0, 3]] ����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������regex-1.12.2/testdata/utf8.toml���������������������������������������������������������������������0000644�0000000�0000000�00000026244�10461020230�0014425�0����������������������������������������������������������������������������������������������������ustar �����������������������������������������������������������������0000000�0000000������������������������������������������������������������������������������������������������������������������������������������������������������������������������# These test the UTF-8 modes expose by regex-automata. Namely, when utf8 is # true, then we promise that the haystack is valid UTF-8. (Otherwise behavior # is unspecified.) This also corresponds to building the regex engine with the # following two guarantees: # # 1) For any non-empty match reported, its span is guaranteed to correspond to # valid UTF-8. # 2) All empty or zero-width matches reported must never split a UTF-8 # encoded codepoint. If the haystack has invalid UTF-8, then this results in # unspecified behavior. # # The (2) is in particular what we focus our testing on since (1) is generally # guaranteed by regex-syntax's AST-to-HIR translator and is well tested there. # The thing with (2) is that it can't be described in the HIR, so the regex # engines have to handle that case. Thus, we test it here. # # Note that it is possible to build a regex that has property (1) but not # (2), and vice versa. This is done by building the HIR with 'utf8=true' but # building the Thompson NFA with 'utf8=false'. We don't test that here because # the harness doesn't expose a way to enable or disable UTF-8 mode with that # granularity. Instead, those combinations are lightly tested via doc examples. # That's not to say that (1) without (2) is uncommon. Indeed, ripgrep uses it # because it cannot guarantee that its haystack is valid UTF-8. # This tests that an empty regex doesn't split a codepoint. [[test]] name = "empty-utf8yes" regex = '' haystack = 'โ˜ƒ' matches = [[0, 0], [3, 3]] unicode = true utf8 = true # Tests the overlapping case of the above. [[test]] name = "empty-utf8yes-overlapping" regex = '' haystack = 'โ˜ƒ' matches = [[0, 0], [3, 3]] unicode = true utf8 = true match-kind = "all" search-kind = "overlapping" # This tests that an empty regex DOES split a codepoint when utf=false. [[test]] name = "empty-utf8no" regex = '' haystack = 'โ˜ƒ' matches = [[0, 0], [1, 1], [2, 2], [3, 3]] unicode = true utf8 = false # Tests the overlapping case of the above. [[test]] name = "empty-utf8no-overlapping" regex = '' haystack = 'โ˜ƒ' matches = [[0, 0], [1, 1], [2, 2], [3, 3]] unicode = true utf8 = false match-kind = "all" search-kind = "overlapping" # This tests that an empty regex doesn't split a codepoint, even if we give # it bounds entirely within the codepoint. # # This is one of the trickier cases and is what motivated the current UTF-8 # mode design. In particular, at one point, this test failed the 'is_match' # variant of the test but not 'find'. This is because the 'is_match' code path # is specifically optimized for "was a match found" rather than "where is the # match." In the former case, you don't really care about the empty-vs-non-empty # matches, and thus, the codepoint splitting filtering logic wasn't getting # applied. (In multiple ways across multiple regex engines.) In this way, you # can wind up with a situation where 'is_match' says "yes," but 'find' says, # "I didn't find anything." Which is... not great. # # I could have decided to say that providing boundaries that themselves split # a codepoint would have unspecified behavior. But I couldn't quite convince # myself that such boundaries were the only way to get an inconsistency between # 'is_match' and 'find'. # # Note that I also tried to come up with a test like this that fails without # using `bounds`. Specifically, a test where 'is_match' and 'find' disagree. # But I couldn't do it, and I'm tempted to conclude it is impossible. The # fundamental problem is that you need to simultaneously produce an empty match # that splits a codepoint while *not* matching before or after the codepoint. [[test]] name = "empty-utf8yes-bounds" regex = '' haystack = '๐›ƒ' bounds = [1, 3] matches = [] unicode = true utf8 = true # Tests the overlapping case of the above. [[test]] name = "empty-utf8yes-bounds-overlapping" regex = '' haystack = '๐›ƒ' bounds = [1, 3] matches = [] unicode = true utf8 = true match-kind = "all" search-kind = "overlapping" # This tests that an empty regex splits a codepoint when the bounds are # entirely within the codepoint. [[test]] name = "empty-utf8no-bounds" regex = '' haystack = '๐›ƒ' bounds = [1, 3] matches = [[1, 1], [2, 2], [3, 3]] unicode = true utf8 = false # Tests the overlapping case of the above. [[test]] name = "empty-utf8no-bounds-overlapping" regex = '' haystack = '๐›ƒ' bounds = [1, 3] matches = [[1, 1], [2, 2], [3, 3]] unicode = true utf8 = false match-kind = "all" search-kind = "overlapping" # In this test, we anchor the search. Since the start position is also a UTF-8 # boundary, we get a match. [[test]] name = "empty-utf8yes-anchored" regex = '' haystack = '๐›ƒ' matches = [[0, 0]] anchored = true unicode = true utf8 = true # Tests the overlapping case of the above. [[test]] name = "empty-utf8yes-anchored-overlapping" regex = '' haystack = '๐›ƒ' matches = [[0, 0]] anchored = true unicode = true utf8 = true match-kind = "all" search-kind = "overlapping" # Same as above, except with UTF-8 mode disabled. It almost doesn't change the # result, except for the fact that since this is an anchored search and we # always find all matches, the test harness will keep reporting matches until # none are found. Because it's anchored, matches will be reported so long as # they are directly adjacent. Since with UTF-8 mode the next anchored search # after the match at [0, 0] fails, iteration stops (and doesn't find the last # match at [4, 4]). [[test]] name = "empty-utf8no-anchored" regex = '' haystack = '๐›ƒ' matches = [[0, 0], [1, 1], [2, 2], [3, 3], [4, 4]] anchored = true unicode = true utf8 = false # Tests the overlapping case of the above. # # Note that overlapping anchored searches are a little weird, and it's not # totally clear what their semantics ought to be. For now, we just test the # current behavior of our test shim that implements overlapping search. (This # is one of the reasons why we don't really expose regex-level overlapping # searches.) [[test]] name = "empty-utf8no-anchored-overlapping" regex = '' haystack = '๐›ƒ' matches = [[0, 0]] anchored = true unicode = true utf8 = false match-kind = "all" search-kind = "overlapping" # In this test, we anchor the search, but also set bounds. The bounds start the # search in the middle of a codepoint, so there should never be a match. [[test]] name = "empty-utf8yes-anchored-bounds" regex = '' haystack = '๐›ƒ' matches = [] bounds = [1, 3] anchored = true unicode = true utf8 = true # Tests the overlapping case of the above. [[test]] name = "empty-utf8yes-anchored-bounds-overlapping" regex = '' haystack = '๐›ƒ' matches = [] bounds = [1, 3] anchored = true unicode = true utf8 = true match-kind = "all" search-kind = "overlapping" # Same as above, except with UTF-8 mode disabled. Without UTF-8 mode enabled, # matching within a codepoint is allowed. And remember, as in the anchored test # above with UTF-8 mode disabled, iteration will report all adjacent matches. # The matches at [0, 0] and [4, 4] are not included because of the bounds of # the search. [[test]] name = "empty-utf8no-anchored-bounds" regex = '' haystack = '๐›ƒ' bounds = [1, 3] matches = [[1, 1], [2, 2], [3, 3]] anchored = true unicode = true utf8 = false # Tests the overlapping case of the above. # # Note that overlapping anchored searches are a little weird, and it's not # totally clear what their semantics ought to be. For now, we just test the # current behavior of our test shim that implements overlapping search. (This # is one of the reasons why we don't really expose regex-level overlapping # searches.) [[test]] name = "empty-utf8no-anchored-bounds-overlapping" regex = '' haystack = '๐›ƒ' bounds = [1, 3] matches = [[1, 1]] anchored = true unicode = true utf8 = false match-kind = "all" search-kind = "overlapping" # This tests that we find the match at the end of the string when the bounds # exclude the first match. [[test]] name = "empty-utf8yes-startbound" regex = '' haystack = '๐›ƒ' bounds = [1, 4] matches = [[4, 4]] unicode = true utf8 = true # Tests the overlapping case of the above. [[test]] name = "empty-utf8yes-startbound-overlapping" regex = '' haystack = '๐›ƒ' bounds = [1, 4] matches = [[4, 4]] unicode = true utf8 = true match-kind = "all" search-kind = "overlapping" # Same as above, except since UTF-8 mode is disabled, we also find the matches # inbetween that split the codepoint. [[test]] name = "empty-utf8no-startbound" regex = '' haystack = '๐›ƒ' bounds = [1, 4] matches = [[1, 1], [2, 2], [3, 3], [4, 4]] unicode = true utf8 = false # Tests the overlapping case of the above. [[test]] name = "empty-utf8no-startbound-overlapping" regex = '' haystack = '๐›ƒ' bounds = [1, 4] matches = [[1, 1], [2, 2], [3, 3], [4, 4]] unicode = true utf8 = false match-kind = "all" search-kind = "overlapping" # This tests that we don't find any matches in an anchored search, even when # the bounds include a match (at the end). [[test]] name = "empty-utf8yes-anchored-startbound" regex = '' haystack = '๐›ƒ' bounds = [1, 4] matches = [] anchored = true unicode = true utf8 = true # Tests the overlapping case of the above. [[test]] name = "empty-utf8yes-anchored-startbound-overlapping" regex = '' haystack = '๐›ƒ' bounds = [1, 4] matches = [] anchored = true unicode = true utf8 = true match-kind = "all" search-kind = "overlapping" # Same as above, except since UTF-8 mode is disabled, we also find the matches # inbetween that split the codepoint. Even though this is an anchored search, # since the matches are adjacent, we find all of them. [[test]] name = "empty-utf8no-anchored-startbound" regex = '' haystack = '๐›ƒ' bounds = [1, 4] matches = [[1, 1], [2, 2], [3, 3], [4, 4]] anchored = true unicode = true utf8 = false # Tests the overlapping case of the above. # # Note that overlapping anchored searches are a little weird, and it's not # totally clear what their semantics ought to be. For now, we just test the # current behavior of our test shim that implements overlapping search. (This # is one of the reasons why we don't really expose regex-level overlapping # searches.) [[test]] name = "empty-utf8no-anchored-startbound-overlapping" regex = '' haystack = '๐›ƒ' bounds = [1, 4] matches = [[1, 1]] anchored = true unicode = true utf8 = false match-kind = "all" search-kind = "overlapping" # This tests that we find the match at the end of the haystack in UTF-8 mode # when our bounds only include the empty string at the end of the haystack. [[test]] name = "empty-utf8yes-anchored-endbound" regex = '' haystack = '๐›ƒ' bounds = [4, 4] matches = [[4, 4]] anchored = true unicode = true utf8 = true # Tests the overlapping case of the above. [[test]] name = "empty-utf8yes-anchored-endbound-overlapping" regex = '' haystack = '๐›ƒ' bounds = [4, 4] matches = [[4, 4]] anchored = true unicode = true utf8 = true match-kind = "all" search-kind = "overlapping" # Same as above, but with UTF-8 mode disabled. Results remain the same since # the only possible match does not split a codepoint. [[test]] name = "empty-utf8no-anchored-endbound" regex = '' haystack = '๐›ƒ' bounds = [4, 4] matches = [[4, 4]] anchored = true unicode = true utf8 = false # Tests the overlapping case of the above. [[test]] name = "empty-utf8no-anchored-endbound-overlapping" regex = '' haystack = '๐›ƒ' bounds = [4, 4] matches = [[4, 4]] anchored = true unicode = true utf8 = false match-kind = "all" search-kind = "overlapping" ������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������regex-1.12.2/testdata/word-boundary-special.toml����������������������������������������������������0000644�0000000�0000000�00000030021�10461020230�0017735�0����������������������������������������������������������������������������������������������������ustar �����������������������������������������������������������������0000000�0000000������������������������������������������������������������������������������������������������������������������������������������������������������������������������# These tests are for the "special" word boundary assertions. That is, # \b{start}, \b{end}, \b{start-half}, \b{end-half}. These are specialty # assertions for more niche use cases, but hitting those cases without these # assertions is difficult. For example, \b{start-half} and \b{end-half} are # used to implement the -w/--word-regexp flag in a grep program. # Tests for (?-u:\b{start}) [[test]] name = "word-start-ascii-010" regex = '\b{start}' haystack = "a" matches = [[0, 0]] unicode = false [[test]] name = "word-start-ascii-020" regex = '\b{start}' haystack = "a " matches = [[0, 0]] unicode = false [[test]] name = "word-start-ascii-030" regex = '\b{start}' haystack = " a " matches = [[1, 1]] unicode = false [[test]] name = "word-start-ascii-040" regex = '\b{start}' haystack = "" matches = [] unicode = false [[test]] name = "word-start-ascii-050" regex = '\b{start}' haystack = "ab" matches = [[0, 0]] unicode = false [[test]] name = "word-start-ascii-060" regex = '\b{start}' haystack = "๐›ƒ" matches = [] unicode = false [[test]] name = "word-start-ascii-060-bounds" regex = '\b{start}' haystack = "๐›ƒ" bounds = [2, 3] matches = [] unicode = false [[test]] name = "word-start-ascii-070" regex = '\b{start}' haystack = " ๐›ƒ " matches = [] unicode = false [[test]] name = "word-start-ascii-080" regex = '\b{start}' haystack = "๐›ƒ๐†€" matches = [] unicode = false [[test]] name = "word-start-ascii-090" regex = '\b{start}' haystack = "๐›ƒb" matches = [[4, 4]] unicode = false [[test]] name = "word-start-ascii-110" regex = '\b{start}' haystack = "b๐›ƒ" matches = [[0, 0]] unicode = false # Tests for (?-u:\b{end}) [[test]] name = "word-end-ascii-010" regex = '\b{end}' haystack = "a" matches = [[1, 1]] unicode = false [[test]] name = "word-end-ascii-020" regex = '\b{end}' haystack = "a " matches = [[1, 1]] unicode = false [[test]] name = "word-end-ascii-030" regex = '\b{end}' haystack = " a " matches = [[2, 2]] unicode = false [[test]] name = "word-end-ascii-040" regex = '\b{end}' haystack = "" matches = [] unicode = false [[test]] name = "word-end-ascii-050" regex = '\b{end}' haystack = "ab" matches = [[2, 2]] unicode = false [[test]] name = "word-end-ascii-060" regex = '\b{end}' haystack = "๐›ƒ" matches = [] unicode = false [[test]] name = "word-end-ascii-060-bounds" regex = '\b{end}' haystack = "๐›ƒ" bounds = [2, 3] matches = [] unicode = false [[test]] name = "word-end-ascii-070" regex = '\b{end}' haystack = " ๐›ƒ " matches = [] unicode = false [[test]] name = "word-end-ascii-080" regex = '\b{end}' haystack = "๐›ƒ๐†€" matches = [] unicode = false [[test]] name = "word-end-ascii-090" regex = '\b{end}' haystack = "๐›ƒb" matches = [[5, 5]] unicode = false [[test]] name = "word-end-ascii-110" regex = '\b{end}' haystack = "b๐›ƒ" matches = [[1, 1]] unicode = false # Tests for \b{start} [[test]] name = "word-start-unicode-010" regex = '\b{start}' haystack = "a" matches = [[0, 0]] unicode = true [[test]] name = "word-start-unicode-020" regex = '\b{start}' haystack = "a " matches = [[0, 0]] unicode = true [[test]] name = "word-start-unicode-030" regex = '\b{start}' haystack = " a " matches = [[1, 1]] unicode = true [[test]] name = "word-start-unicode-040" regex = '\b{start}' haystack = "" matches = [] unicode = true [[test]] name = "word-start-unicode-050" regex = '\b{start}' haystack = "ab" matches = [[0, 0]] unicode = true [[test]] name = "word-start-unicode-060" regex = '\b{start}' haystack = "๐›ƒ" matches = [[0, 0]] unicode = true [[test]] name = "word-start-unicode-060-bounds" regex = '\b{start}' haystack = "๐›ƒ" bounds = [2, 3] matches = [] unicode = true [[test]] name = "word-start-unicode-070" regex = '\b{start}' haystack = " ๐›ƒ " matches = [[1, 1]] unicode = true [[test]] name = "word-start-unicode-080" regex = '\b{start}' haystack = "๐›ƒ๐†€" matches = [[0, 0]] unicode = true [[test]] name = "word-start-unicode-090" regex = '\b{start}' haystack = "๐›ƒb" matches = [[0, 0]] unicode = true [[test]] name = "word-start-unicode-110" regex = '\b{start}' haystack = "b๐›ƒ" matches = [[0, 0]] unicode = true # Tests for \b{end} [[test]] name = "word-end-unicode-010" regex = '\b{end}' haystack = "a" matches = [[1, 1]] unicode = true [[test]] name = "word-end-unicode-020" regex = '\b{end}' haystack = "a " matches = [[1, 1]] unicode = true [[test]] name = "word-end-unicode-030" regex = '\b{end}' haystack = " a " matches = [[2, 2]] unicode = true [[test]] name = "word-end-unicode-040" regex = '\b{end}' haystack = "" matches = [] unicode = true [[test]] name = "word-end-unicode-050" regex = '\b{end}' haystack = "ab" matches = [[2, 2]] unicode = true [[test]] name = "word-end-unicode-060" regex = '\b{end}' haystack = "๐›ƒ" matches = [[4, 4]] unicode = true [[test]] name = "word-end-unicode-060-bounds" regex = '\b{end}' haystack = "๐›ƒ" bounds = [2, 3] matches = [] unicode = true [[test]] name = "word-end-unicode-070" regex = '\b{end}' haystack = " ๐›ƒ " matches = [[5, 5]] unicode = true [[test]] name = "word-end-unicode-080" regex = '\b{end}' haystack = "๐›ƒ๐†€" matches = [[4, 4]] unicode = true [[test]] name = "word-end-unicode-090" regex = '\b{end}' haystack = "๐›ƒb" matches = [[5, 5]] unicode = true [[test]] name = "word-end-unicode-110" regex = '\b{end}' haystack = "b๐›ƒ" matches = [[5, 5]] unicode = true # Tests for (?-u:\b{start-half}) [[test]] name = "word-start-half-ascii-010" regex = '\b{start-half}' haystack = "a" matches = [[0, 0]] unicode = false [[test]] name = "word-start-half-ascii-020" regex = '\b{start-half}' haystack = "a " matches = [[0, 0], [2, 2]] unicode = false [[test]] name = "word-start-half-ascii-030" regex = '\b{start-half}' haystack = " a " matches = [[0, 0], [1, 1], [3, 3]] unicode = false [[test]] name = "word-start-half-ascii-040" regex = '\b{start-half}' haystack = "" matches = [[0, 0]] unicode = false [[test]] name = "word-start-half-ascii-050" regex = '\b{start-half}' haystack = "ab" matches = [[0, 0]] unicode = false [[test]] name = "word-start-half-ascii-060" regex = '\b{start-half}' haystack = "๐›ƒ" matches = [[0, 0], [4, 4]] unicode = false [[test]] name = "word-start-half-ascii-060-noutf8" regex = '\b{start-half}' haystack = "๐›ƒ" matches = [[0, 0], [1, 1], [2, 2], [3, 3], [4, 4]] unicode = false utf8 = false [[test]] name = "word-start-half-ascii-060-bounds" regex = '\b{start-half}' haystack = "๐›ƒ" bounds = [2, 3] matches = [] unicode = false [[test]] name = "word-start-half-ascii-070" regex = '\b{start-half}' haystack = " ๐›ƒ " matches = [[0, 0], [1, 1], [5, 5], [6, 6]] unicode = false [[test]] name = "word-start-half-ascii-080" regex = '\b{start-half}' haystack = "๐›ƒ๐†€" matches = [[0, 0], [4, 4], [8, 8]] unicode = false [[test]] name = "word-start-half-ascii-090" regex = '\b{start-half}' haystack = "๐›ƒb" matches = [[0, 0], [4, 4]] unicode = false [[test]] name = "word-start-half-ascii-110" regex = '\b{start-half}' haystack = "b๐›ƒ" matches = [[0, 0], [5, 5]] unicode = false # Tests for (?-u:\b{end-half}) [[test]] name = "word-end-half-ascii-010" regex = '\b{end-half}' haystack = "a" matches = [[1, 1]] unicode = false [[test]] name = "word-end-half-ascii-020" regex = '\b{end-half}' haystack = "a " matches = [[1, 1], [2, 2]] unicode = false [[test]] name = "word-end-half-ascii-030" regex = '\b{end-half}' haystack = " a " matches = [[0, 0], [2, 2], [3, 3]] unicode = false [[test]] name = "word-end-half-ascii-040" regex = '\b{end-half}' haystack = "" matches = [[0, 0]] unicode = false [[test]] name = "word-end-half-ascii-050" regex = '\b{end-half}' haystack = "ab" matches = [[2, 2]] unicode = false [[test]] name = "word-end-half-ascii-060" regex = '\b{end-half}' haystack = "๐›ƒ" matches = [[0, 0], [4, 4]] unicode = false [[test]] name = "word-end-half-ascii-060-bounds" regex = '\b{end-half}' haystack = "๐›ƒ" bounds = [2, 3] matches = [] unicode = false [[test]] name = "word-end-half-ascii-070" regex = '\b{end-half}' haystack = " ๐›ƒ " matches = [[0, 0], [1, 1], [5, 5], [6, 6]] unicode = false [[test]] name = "word-end-half-ascii-080" regex = '\b{end-half}' haystack = "๐›ƒ๐†€" matches = [[0, 0], [4, 4], [8, 8]] unicode = false [[test]] name = "word-end-half-ascii-090" regex = '\b{end-half}' haystack = "๐›ƒb" matches = [[0, 0], [5, 5]] unicode = false [[test]] name = "word-end-half-ascii-110" regex = '\b{end-half}' haystack = "b๐›ƒ" matches = [[1, 1], [5, 5]] unicode = false # Tests for \b{start-half} [[test]] name = "word-start-half-unicode-010" regex = '\b{start-half}' haystack = "a" matches = [[0, 0]] unicode = true [[test]] name = "word-start-half-unicode-020" regex = '\b{start-half}' haystack = "a " matches = [[0, 0], [2, 2]] unicode = true [[test]] name = "word-start-half-unicode-030" regex = '\b{start-half}' haystack = " a " matches = [[0, 0], [1, 1], [3, 3]] unicode = true [[test]] name = "word-start-half-unicode-040" regex = '\b{start-half}' haystack = "" matches = [[0, 0]] unicode = true [[test]] name = "word-start-half-unicode-050" regex = '\b{start-half}' haystack = "ab" matches = [[0, 0]] unicode = true [[test]] name = "word-start-half-unicode-060" regex = '\b{start-half}' haystack = "๐›ƒ" matches = [[0, 0]] unicode = true [[test]] name = "word-start-half-unicode-060-bounds" regex = '\b{start-half}' haystack = "๐›ƒ" bounds = [2, 3] matches = [] unicode = true [[test]] name = "word-start-half-unicode-070" regex = '\b{start-half}' haystack = " ๐›ƒ " matches = [[0, 0], [1, 1], [6, 6]] unicode = true [[test]] name = "word-start-half-unicode-080" regex = '\b{start-half}' haystack = "๐›ƒ๐†€" matches = [[0, 0], [8, 8]] unicode = true [[test]] name = "word-start-half-unicode-090" regex = '\b{start-half}' haystack = "๐›ƒb" matches = [[0, 0]] unicode = true [[test]] name = "word-start-half-unicode-110" regex = '\b{start-half}' haystack = "b๐›ƒ" matches = [[0, 0]] unicode = true # Tests for \b{end-half} [[test]] name = "word-end-half-unicode-010" regex = '\b{end-half}' haystack = "a" matches = [[1, 1]] unicode = true [[test]] name = "word-end-half-unicode-020" regex = '\b{end-half}' haystack = "a " matches = [[1, 1], [2, 2]] unicode = true [[test]] name = "word-end-half-unicode-030" regex = '\b{end-half}' haystack = " a " matches = [[0, 0], [2, 2], [3, 3]] unicode = true [[test]] name = "word-end-half-unicode-040" regex = '\b{end-half}' haystack = "" matches = [[0, 0]] unicode = true [[test]] name = "word-end-half-unicode-050" regex = '\b{end-half}' haystack = "ab" matches = [[2, 2]] unicode = true [[test]] name = "word-end-half-unicode-060" regex = '\b{end-half}' haystack = "๐›ƒ" matches = [[4, 4]] unicode = true [[test]] name = "word-end-half-unicode-060-bounds" regex = '\b{end-half}' haystack = "๐›ƒ" bounds = [2, 3] matches = [] unicode = true [[test]] name = "word-end-half-unicode-070" regex = '\b{end-half}' haystack = " ๐›ƒ " matches = [[0, 0], [5, 5], [6, 6]] unicode = true [[test]] name = "word-end-half-unicode-080" regex = '\b{end-half}' haystack = "๐›ƒ๐†€" matches = [[4, 4], [8, 8]] unicode = true [[test]] name = "word-end-half-unicode-090" regex = '\b{end-half}' haystack = "๐›ƒb" matches = [[5, 5]] unicode = true [[test]] name = "word-end-half-unicode-110" regex = '\b{end-half}' haystack = "b๐›ƒ" matches = [[5, 5]] unicode = true # Specialty tests. # Since \r is special cased in the start state computation (to deal with CRLF # mode), this test ensures that the correct start state is computed when the # pattern starts with a half word boundary assertion. [[test]] name = "word-start-half-ascii-carriage" regex = '\b{start-half}[a-z]+' haystack = 'ABC\rabc' matches = [[4, 7]] bounds = [4, 7] unescape = true # Since \n is also special cased in the start state computation, this test # ensures that the correct start state is computed when the pattern starts with # a half word boundary assertion. [[test]] name = "word-start-half-ascii-linefeed" regex = '\b{start-half}[a-z]+' haystack = 'ABC\nabc' matches = [[4, 7]] bounds = [4, 7] unescape = true # Like the carriage return test above, but with a custom line terminator. [[test]] name = "word-start-half-ascii-customlineterm" regex = '\b{start-half}[a-z]+' haystack = 'ABC!abc' matches = [[4, 7]] bounds = [4, 7] unescape = true line-terminator = '!' ���������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������regex-1.12.2/testdata/word-boundary.toml������������������������������������������������������������0000644�0000000�0000000�00000027701�10461020230�0016332�0����������������������������������������������������������������������������������������������������ustar �����������������������������������������������������������������0000000�0000000������������������������������������������������������������������������������������������������������������������������������������������������������������������������# Some of these are cribbed from RE2's test suite. # These test \b. Below are tests for \B. [[test]] name = "wb1" regex = '\b' haystack = "" matches = [] unicode = false [[test]] name = "wb2" regex = '\b' haystack = "a" matches = [[0, 0], [1, 1]] unicode = false [[test]] name = "wb3" regex = '\b' haystack = "ab" matches = [[0, 0], [2, 2]] unicode = false [[test]] name = "wb4" regex = '^\b' haystack = "ab" matches = [[0, 0]] unicode = false [[test]] name = "wb5" regex = '\b$' haystack = "ab" matches = [[2, 2]] unicode = false [[test]] name = "wb6" regex = '^\b$' haystack = "ab" matches = [] unicode = false [[test]] name = "wb7" regex = '\bbar\b' haystack = "nobar bar foo bar" matches = [[6, 9], [14, 17]] unicode = false [[test]] name = "wb8" regex = 'a\b' haystack = "faoa x" matches = [[3, 4]] unicode = false [[test]] name = "wb9" regex = '\bbar' haystack = "bar x" matches = [[0, 3]] unicode = false [[test]] name = "wb10" regex = '\bbar' haystack = "foo\nbar x" matches = [[4, 7]] unicode = false [[test]] name = "wb11" regex = 'bar\b' haystack = "foobar" matches = [[3, 6]] unicode = false [[test]] name = "wb12" regex = 'bar\b' haystack = "foobar\nxxx" matches = [[3, 6]] unicode = false [[test]] name = "wb13" regex = '(?:foo|bar|[A-Z])\b' haystack = "foo" matches = [[0, 3]] unicode = false [[test]] name = "wb14" regex = '(?:foo|bar|[A-Z])\b' haystack = "foo\n" matches = [[0, 3]] unicode = false [[test]] name = "wb15" regex = '\b(?:foo|bar|[A-Z])' haystack = "foo" matches = [[0, 3]] unicode = false [[test]] name = "wb16" regex = '\b(?:foo|bar|[A-Z])\b' haystack = "X" matches = [[0, 1]] unicode = false [[test]] name = "wb17" regex = '\b(?:foo|bar|[A-Z])\b' haystack = "XY" matches = [] unicode = false [[test]] name = "wb18" regex = '\b(?:foo|bar|[A-Z])\b' haystack = "bar" matches = [[0, 3]] unicode = false [[test]] name = "wb19" regex = '\b(?:foo|bar|[A-Z])\b' haystack = "foo" matches = [[0, 3]] unicode = false [[test]] name = "wb20" regex = '\b(?:foo|bar|[A-Z])\b' haystack = "foo\n" matches = [[0, 3]] unicode = false [[test]] name = "wb21" regex = '\b(?:foo|bar|[A-Z])\b' haystack = "ffoo bbar N x" matches = [[10, 11]] unicode = false [[test]] name = "wb22" regex = '\b(?:fo|foo)\b' haystack = "fo" matches = [[0, 2]] unicode = false [[test]] name = "wb23" regex = '\b(?:fo|foo)\b' haystack = "foo" matches = [[0, 3]] unicode = false [[test]] name = "wb24" regex = '\b\b' haystack = "" matches = [] unicode = false [[test]] name = "wb25" regex = '\b\b' haystack = "a" matches = [[0, 0], [1, 1]] unicode = false [[test]] name = "wb26" regex = '\b$' haystack = "" matches = [] unicode = false [[test]] name = "wb27" regex = '\b$' haystack = "x" matches = [[1, 1]] unicode = false [[test]] name = "wb28" regex = '\b$' haystack = "y x" matches = [[3, 3]] unicode = false [[test]] name = "wb29" regex = '(?-u:\b).$' haystack = "x" matches = [[0, 1]] [[test]] name = "wb30" regex = '^\b(?:fo|foo)\b' haystack = "fo" matches = [[0, 2]] unicode = false [[test]] name = "wb31" regex = '^\b(?:fo|foo)\b' haystack = "foo" matches = [[0, 3]] unicode = false [[test]] name = "wb32" regex = '^\b$' haystack = "" matches = [] unicode = false [[test]] name = "wb33" regex = '^\b$' haystack = "x" matches = [] unicode = false [[test]] name = "wb34" regex = '^(?-u:\b).$' haystack = "x" matches = [[0, 1]] [[test]] name = "wb35" regex = '^(?-u:\b).(?-u:\b)$' haystack = "x" matches = [[0, 1]] [[test]] name = "wb36" regex = '^^^^^\b$$$$$' haystack = "" matches = [] unicode = false [[test]] name = "wb37" regex = '^^^^^(?-u:\b).$$$$$' haystack = "x" matches = [[0, 1]] [[test]] name = "wb38" regex = '^^^^^\b$$$$$' haystack = "x" matches = [] unicode = false [[test]] name = "wb39" regex = '^^^^^(?-u:\b\b\b).(?-u:\b\b\b)$$$$$' haystack = "x" matches = [[0, 1]] [[test]] name = "wb40" regex = '(?-u:\b).+(?-u:\b)' haystack = "$$abc$$" matches = [[2, 5]] [[test]] name = "wb41" regex = '\b' haystack = "a b c" matches = [[0, 0], [1, 1], [2, 2], [3, 3], [4, 4], [5, 5]] unicode = false [[test]] name = "wb42" regex = '\bfoo\b' haystack = "zzz foo zzz" matches = [[4, 7]] unicode = false [[test]] name = "wb43" regex = '\b^' haystack = "ab" matches = [[0, 0]] unicode = false [[test]] name = "wb44" regex = '$\b' haystack = "ab" matches = [[2, 2]] unicode = false # Tests for \B. Note that \B is not allowed if UTF-8 mode is enabled, so we # have to disable it for most of these tests. This is because \B can match at # non-UTF-8 boundaries. [[test]] name = "nb1" regex = '\Bfoo\B' haystack = "n foo xfoox that" matches = [[7, 10]] unicode = false utf8 = false [[test]] name = "nb2" regex = 'a\B' haystack = "faoa x" matches = [[1, 2]] unicode = false utf8 = false [[test]] name = "nb3" regex = '\Bbar' haystack = "bar x" matches = [] unicode = false utf8 = false [[test]] name = "nb4" regex = '\Bbar' haystack = "foo\nbar x" matches = [] unicode = false utf8 = false [[test]] name = "nb5" regex = 'bar\B' haystack = "foobar" matches = [] unicode = false utf8 = false [[test]] name = "nb6" regex = 'bar\B' haystack = "foobar\nxxx" matches = [] unicode = false utf8 = false [[test]] name = "nb7" regex = '(?:foo|bar|[A-Z])\B' haystack = "foox" matches = [[0, 3]] unicode = false utf8 = false [[test]] name = "nb8" regex = '(?:foo|bar|[A-Z])\B' haystack = "foo\n" matches = [] unicode = false utf8 = false [[test]] name = "nb9" regex = '\B' haystack = "" matches = [[0, 0]] unicode = false utf8 = false [[test]] name = "nb10" regex = '\B' haystack = "x" matches = [] unicode = false utf8 = false [[test]] name = "nb11" regex = '\B(?:foo|bar|[A-Z])' haystack = "foo" matches = [] unicode = false utf8 = false [[test]] name = "nb12" regex = '\B(?:foo|bar|[A-Z])\B' haystack = "xXy" matches = [[1, 2]] unicode = false utf8 = false [[test]] name = "nb13" regex = '\B(?:foo|bar|[A-Z])\B' haystack = "XY" matches = [] unicode = false utf8 = false [[test]] name = "nb14" regex = '\B(?:foo|bar|[A-Z])\B' haystack = "XYZ" matches = [[1, 2]] unicode = false utf8 = false [[test]] name = "nb15" regex = '\B(?:foo|bar|[A-Z])\B' haystack = "abara" matches = [[1, 4]] unicode = false utf8 = false [[test]] name = "nb16" regex = '\B(?:foo|bar|[A-Z])\B' haystack = "xfoo_" matches = [[1, 4]] unicode = false utf8 = false [[test]] name = "nb17" regex = '\B(?:foo|bar|[A-Z])\B' haystack = "xfoo\n" matches = [] unicode = false utf8 = false [[test]] name = "nb18" regex = '\B(?:foo|bar|[A-Z])\B' haystack = "foo bar vNX" matches = [[9, 10]] unicode = false utf8 = false [[test]] name = "nb19" regex = '\B(?:fo|foo)\B' haystack = "xfoo" matches = [[1, 3]] unicode = false utf8 = false [[test]] name = "nb20" regex = '\B(?:foo|fo)\B' haystack = "xfooo" matches = [[1, 4]] unicode = false utf8 = false [[test]] name = "nb21" regex = '\B\B' haystack = "" matches = [[0, 0]] unicode = false utf8 = false [[test]] name = "nb22" regex = '\B\B' haystack = "x" matches = [] unicode = false utf8 = false [[test]] name = "nb23" regex = '\B$' haystack = "" matches = [[0, 0]] unicode = false utf8 = false [[test]] name = "nb24" regex = '\B$' haystack = "x" matches = [] unicode = false utf8 = false [[test]] name = "nb25" regex = '\B$' haystack = "y x" matches = [] unicode = false utf8 = false [[test]] name = "nb26" regex = '\B.$' haystack = "x" matches = [] unicode = false utf8 = false [[test]] name = "nb27" regex = '^\B(?:fo|foo)\B' haystack = "fo" matches = [] unicode = false utf8 = false [[test]] name = "nb28" regex = '^\B(?:fo|foo)\B' haystack = "fo" matches = [] unicode = false utf8 = false [[test]] name = "nb29" regex = '^\B' haystack = "" matches = [[0, 0]] unicode = false utf8 = false [[test]] name = "nb30" regex = '^\B' haystack = "x" matches = [] unicode = false utf8 = false [[test]] name = "nb31" regex = '^\B\B' haystack = "" matches = [[0, 0]] unicode = false utf8 = false [[test]] name = "nb32" regex = '^\B\B' haystack = "x" matches = [] unicode = false utf8 = false [[test]] name = "nb33" regex = '^\B$' haystack = "" matches = [[0, 0]] unicode = false utf8 = false [[test]] name = "nb34" regex = '^\B$' haystack = "x" matches = [] unicode = false utf8 = false [[test]] name = "nb35" regex = '^\B.$' haystack = "x" matches = [] unicode = false utf8 = false [[test]] name = "nb36" regex = '^\B.\B$' haystack = "x" matches = [] unicode = false utf8 = false [[test]] name = "nb37" regex = '^^^^^\B$$$$$' haystack = "" matches = [[0, 0]] unicode = false utf8 = false [[test]] name = "nb38" regex = '^^^^^\B.$$$$$' haystack = "x" matches = [] unicode = false utf8 = false [[test]] name = "nb39" regex = '^^^^^\B$$$$$' haystack = "x" matches = [] unicode = false utf8 = false # unicode1* and unicode2* work for both Unicode and ASCII because all matches # are reported as byte offsets, and ยซ and ยป do not correspond to word # boundaries at either the character or byte level. [[test]] name = "unicode1" regex = '\bx\b' haystack = "ยซx" matches = [[2, 3]] [[test]] name = "unicode1-only-ascii" regex = '\bx\b' haystack = "ยซx" matches = [[2, 3]] unicode = false [[test]] name = "unicode2" regex = '\bx\b' haystack = "xยป" matches = [[0, 1]] [[test]] name = "unicode2-only-ascii" regex = '\bx\b' haystack = "xยป" matches = [[0, 1]] unicode = false # ASCII word boundaries are completely oblivious to Unicode characters, so # even though ฮฒ is a character, an ASCII \b treats it as a word boundary # when it is adjacent to another ASCII character. (The ASCII \b only looks # at the leading byte of ฮฒ.) For Unicode \b, the tests are precisely inverted. [[test]] name = "unicode3" regex = '\bx\b' haystack = 'รกxฮฒ' matches = [] [[test]] name = "unicode3-only-ascii" regex = '\bx\b' haystack = 'รกxฮฒ' matches = [[2, 3]] unicode = false [[test]] name = "unicode4" regex = '\Bx\B' haystack = 'รกxฮฒ' matches = [[2, 3]] [[test]] name = "unicode4-only-ascii" regex = '\Bx\B' haystack = 'รกxฮฒ' matches = [] unicode = false utf8 = false # The same as above, but with \b instead of \B as a sanity check. [[test]] name = "unicode5" regex = '\b' haystack = "0\U0007EF5E" matches = [[0, 0], [1, 1]] [[test]] name = "unicode5-only-ascii" regex = '\b' haystack = "0\U0007EF5E" matches = [[0, 0], [1, 1]] unicode = false utf8 = false [[test]] name = "unicode5-noutf8" regex = '\b' haystack = '0\xFF\xFF\xFF\xFF' matches = [[0, 0], [1, 1]] unescape = true utf8 = false [[test]] name = "unicode5-noutf8-only-ascii" regex = '\b' haystack = '0\xFF\xFF\xFF\xFF' matches = [[0, 0], [1, 1]] unescape = true unicode = false utf8 = false # Weird special case to ensure that ASCII \B treats each individual code unit # as a non-word byte. (The specific codepoint is irrelevant. It's an arbitrary # codepoint that uses 4 bytes in its UTF-8 encoding and is not a member of the # \w character class.) [[test]] name = "unicode5-not" regex = '\B' haystack = "0\U0007EF5E" matches = [[5, 5]] [[test]] name = "unicode5-not-only-ascii" regex = '\B' haystack = "0\U0007EF5E" matches = [[2, 2], [3, 3], [4, 4], [5, 5]] unicode = false utf8 = false # This gets no matches since \B only matches in the presence of valid UTF-8 # when Unicode is enabled, even when UTF-8 mode is disabled. [[test]] name = "unicode5-not-noutf8" regex = '\B' haystack = '0\xFF\xFF\xFF\xFF' matches = [] unescape = true utf8 = false # But this DOES get matches since \B in ASCII mode only looks at individual # bytes. [[test]] name = "unicode5-not-noutf8-only-ascii" regex = '\B' haystack = '0\xFF\xFF\xFF\xFF' matches = [[2, 2], [3, 3], [4, 4], [5, 5]] unescape = true unicode = false utf8 = false # Some tests of no particular significance. [[test]] name = "unicode6" regex = '\b[0-9]+\b' haystack = "foo 123 bar 456 quux 789" matches = [[4, 7], [12, 15], [21, 24]] [[test]] name = "unicode7" regex = '\b[0-9]+\b' haystack = "foo 123 bar a456 quux 789" matches = [[4, 7], [22, 25]] [[test]] name = "unicode8" regex = '\b[0-9]+\b' haystack = "foo 123 bar 456a quux 789" matches = [[4, 7], [22, 25]] # A variant of the problem described here: # https://github.com/google/re2/blob/89567f5de5b23bb5ad0c26cbafc10bdc7389d1fa/re2/dfa.cc#L658-L667 [[test]] name = "alt-with-assertion-repetition" regex = '(?:\b|%)+' haystack = "z%" bounds = [1, 2] anchored = true matches = [[1, 1]] ���������������������������������������������������������������regex-1.12.2/tests/lib.rs���������������������������������������������������������������������������0000644�0000000�0000000�00000002415�10461020230�0013301�0����������������������������������������������������������������������������������������������������ustar �����������������������������������������������������������������0000000�0000000������������������������������������������������������������������������������������������������������������������������������������������������������������������������#![cfg_attr(feature = "pattern", feature(pattern))] mod fuzz; mod misc; mod regression; mod regression_fuzz; mod replace; #[cfg(feature = "pattern")] mod searcher; mod suite_bytes; mod suite_bytes_set; mod suite_string; mod suite_string_set; const BLACKLIST: &[&str] = &[ // Nothing to blacklist yet! ]; fn suite() -> anyhow::Result<regex_test::RegexTests> { let _ = env_logger::try_init(); let mut tests = regex_test::RegexTests::new(); macro_rules! load { ($name:expr) => {{ const DATA: &[u8] = include_bytes!(concat!("../testdata/", $name, ".toml")); tests.load_slice($name, DATA)?; }}; } load!("anchored"); load!("bytes"); load!("crazy"); load!("crlf"); load!("earliest"); load!("empty"); load!("expensive"); load!("flags"); load!("iter"); load!("leftmost-all"); load!("line-terminator"); load!("misc"); load!("multiline"); load!("no-unicode"); load!("overlapping"); load!("regression"); load!("set"); load!("substring"); load!("unicode"); load!("utf8"); load!("word-boundary"); load!("word-boundary-special"); load!("fowler/basic"); load!("fowler/nullsubexpr"); load!("fowler/repetition"); Ok(tests) } ���������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������regex-1.12.2/tests/misc.rs��������������������������������������������������������������������������0000644�0000000�0000000�00000007150�10461020230�0013467�0����������������������������������������������������������������������������������������������������ustar �����������������������������������������������������������������0000000�0000000������������������������������������������������������������������������������������������������������������������������������������������������������������������������use regex::Regex; macro_rules! regex { ($pattern:expr) => { regex::Regex::new($pattern).unwrap() }; } #[test] fn unclosed_group_error() { let err = Regex::new(r"(").unwrap_err(); let msg = err.to_string(); assert!(msg.contains("unclosed group"), "error message: {msg:?}"); } #[test] fn regex_string() { assert_eq!(r"[a-zA-Z0-9]+", regex!(r"[a-zA-Z0-9]+").as_str()); assert_eq!(r"[a-zA-Z0-9]+", &format!("{}", regex!(r"[a-zA-Z0-9]+"))); assert_eq!( r#"Regex("[a-zA-Z0-9]+")"#, &format!("{:?}", regex!(r"[a-zA-Z0-9]+")) ); } #[test] fn capture_names() { let re = regex!(r"(.)(?P<a>.)"); assert_eq!(3, re.captures_len()); assert_eq!((3, Some(3)), re.capture_names().size_hint()); assert_eq!( vec![None, None, Some("a")], re.capture_names().collect::<Vec<_>>() ); } #[test] fn capture_index() { let re = regex!(r"^(?P<name>.+)$"); let cap = re.captures("abc").unwrap(); assert_eq!(&cap[0], "abc"); assert_eq!(&cap[1], "abc"); assert_eq!(&cap["name"], "abc"); } #[test] #[should_panic] fn capture_index_panic_usize() { let re = regex!(r"^(?P<name>.+)$"); let cap = re.captures("abc").unwrap(); let _ = cap[2]; } #[test] #[should_panic] fn capture_index_panic_name() { let re = regex!(r"^(?P<name>.+)$"); let cap = re.captures("abc").unwrap(); let _ = cap["bad name"]; } #[test] fn capture_index_lifetime() { // This is a test of whether the types on `caps["..."]` are general // enough. If not, this will fail to typecheck. fn inner(s: &str) -> usize { let re = regex!(r"(?P<number>[0-9]+)"); let caps = re.captures(s).unwrap(); caps["number"].len() } assert_eq!(3, inner("123")); } #[test] fn capture_misc() { let re = regex!(r"(.)(?P<a>a)?(.)(?P<b>.)"); let cap = re.captures("abc").unwrap(); assert_eq!(5, cap.len()); assert_eq!((0, 3), { let m = cap.get(0).unwrap(); (m.start(), m.end()) }); assert_eq!(None, cap.get(2)); assert_eq!((2, 3), { let m = cap.get(4).unwrap(); (m.start(), m.end()) }); assert_eq!("abc", cap.get(0).unwrap().as_str()); assert_eq!(None, cap.get(2)); assert_eq!("c", cap.get(4).unwrap().as_str()); assert_eq!(None, cap.name("a")); assert_eq!("c", cap.name("b").unwrap().as_str()); } #[test] fn sub_capture_matches() { let re = regex!(r"([a-z])(([a-z])|([0-9]))"); let cap = re.captures("a5").unwrap(); let subs: Vec<_> = cap.iter().collect(); assert_eq!(5, subs.len()); assert!(subs[0].is_some()); assert!(subs[1].is_some()); assert!(subs[2].is_some()); assert!(subs[3].is_none()); assert!(subs[4].is_some()); assert_eq!("a5", subs[0].unwrap().as_str()); assert_eq!("a", subs[1].unwrap().as_str()); assert_eq!("5", subs[2].unwrap().as_str()); assert_eq!("5", subs[4].unwrap().as_str()); } // Test that the DFA can handle pathological cases. (This should result in the // DFA's cache being flushed too frequently, which should cause it to quit and // fall back to the NFA algorithm.) #[test] fn dfa_handles_pathological_case() { fn ones_and_zeroes(count: usize) -> String { let mut s = String::new(); for i in 0..count { if i % 3 == 0 { s.push('1'); } else { s.push('0'); } } s } let re = regex!(r"[01]*1[01]{20}$"); let text = { let mut pieces = ones_and_zeroes(100_000); pieces.push('1'); pieces.push_str(&ones_and_zeroes(20)); pieces }; assert!(re.is_match(&text)); } ������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������regex-1.12.2/tests/regression.rs��������������������������������������������������������������������0000644�0000000�0000000�00000005251�10461020230�0014714�0����������������������������������������������������������������������������������������������������ustar �����������������������������������������������������������������0000000�0000000������������������������������������������������������������������������������������������������������������������������������������������������������������������������use regex::Regex; macro_rules! regex { ($pattern:expr) => { regex::Regex::new($pattern).unwrap() }; } // See: https://github.com/rust-lang/regex/issues/48 #[test] fn invalid_regexes_no_crash() { assert!(Regex::new("(*)").is_err()); assert!(Regex::new("(?:?)").is_err()); assert!(Regex::new("(?)").is_err()); assert!(Regex::new("*").is_err()); } // See: https://github.com/rust-lang/regex/issues/98 #[test] fn regression_many_repeat_stack_overflow() { let re = regex!("^.{1,2500}"); assert_eq!( vec![0..1], re.find_iter("a").map(|m| m.range()).collect::<Vec<_>>() ); } // See: https://github.com/rust-lang/regex/issues/555 #[test] fn regression_invalid_repetition_expr() { assert!(Regex::new("(?m){1,1}").is_err()); } // See: https://github.com/rust-lang/regex/issues/527 #[test] fn regression_invalid_flags_expression() { assert!(Regex::new("(((?x)))").is_ok()); } // See: https://github.com/rust-lang/regex/issues/129 #[test] fn regression_captures_rep() { let re = regex!(r"([a-f]){2}(?P<foo>[x-z])"); let caps = re.captures("abx").unwrap(); assert_eq!(&caps["foo"], "x"); } // See: https://github.com/BurntSushi/ripgrep/issues/1247 #[cfg(feature = "unicode-perl")] #[test] fn regression_nfa_stops1() { let re = regex::bytes::Regex::new(r"\bs(?:[ab])").unwrap(); assert_eq!(0, re.find_iter(b"s\xE4").count()); } // See: https://github.com/rust-lang/regex/issues/981 #[cfg(feature = "unicode")] #[test] fn regression_bad_word_boundary() { let re = regex!(r#"(?i:(?:\b|_)win(?:32|64|dows)?(?:\b|_))"#); let hay = "ubi-Darwin-x86_64.tar.gz"; assert!(!re.is_match(hay)); let hay = "ubi-Windows-x86_64.zip"; assert!(re.is_match(hay)); } // See: https://github.com/rust-lang/regex/issues/982 #[cfg(feature = "unicode-perl")] #[test] fn regression_unicode_perl_not_enabled() { let pat = r"(\d+\s?(years|year|y))?\s?(\d+\s?(months|month|m))?\s?(\d+\s?(weeks|week|w))?\s?(\d+\s?(days|day|d))?\s?(\d+\s?(hours|hour|h))?"; assert!(Regex::new(pat).is_ok()); } // See: https://github.com/rust-lang/regex/issues/995 #[test] fn regression_big_regex_overflow() { let pat = r" {2147483516}{2147483416}{5}"; assert!(Regex::new(pat).is_err()); } // See: https://github.com/rust-lang/regex/issues/999 #[test] fn regression_complete_literals_suffix_incorrect() { let needles = vec![ "aA", "bA", "cA", "dA", "eA", "fA", "gA", "hA", "iA", "jA", "kA", "lA", "mA", "nA", "oA", "pA", "qA", "rA", "sA", "tA", "uA", "vA", "wA", "xA", "yA", "zA", ]; let pattern = needles.join("|"); let re = regex!(&pattern); let hay = "FUBAR"; assert_eq!(0, re.find_iter(hay).count()); } �������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������regex-1.12.2/tests/regression_fuzz.rs���������������������������������������������������������������0000644�0000000�0000000�00000003405�10461020230�0015771�0����������������������������������������������������������������������������������������������������ustar �����������������������������������������������������������������0000000�0000000������������������������������������������������������������������������������������������������������������������������������������������������������������������������// These tests are only run for the "default" test target because some of them // can take quite a long time. Some of them take long enough that it's not // practical to run them in debug mode. :-/ use regex::Regex; macro_rules! regex { ($pattern:expr) => { regex::Regex::new($pattern).unwrap() }; } // See: https://oss-fuzz.com/testcase-detail/5673225499181056 // // Ignored by default since it takes too long in debug mode (almost a minute). #[test] #[ignore] fn fuzz1() { regex!(r"1}{55}{0}*{1}{55}{55}{5}*{1}{55}+{56}|;**"); } // See: https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=26505 // See: https://github.com/rust-lang/regex/issues/722 #[test] #[cfg(feature = "unicode")] fn empty_any_errors_no_panic() { assert!(Regex::new(r"\P{any}").is_ok()); } // This tests that a very large regex errors during compilation instead of // using gratuitous amounts of memory. The specific problem is that the // compiler wasn't accounting for the memory used by Unicode character classes // correctly. // // See: https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=33579 #[test] fn big_regex_fails_to_compile() { let pat = "[\u{0}\u{e}\u{2}\\w~~>[l\t\u{0}]p?<]{971158}"; assert!(Regex::new(pat).is_err()); } // This was caught while on master but before a release went out(!). // // See: https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=58173 #[test] fn todo() { let pat = "(?:z|xx)@|xx"; assert!(Regex::new(pat).is_ok()); } // This was caused by the fuzzer, and then minimized by hand. // // This was caused by a bug in DFA determinization that mishandled NFA fail // states. #[test] fn fail_branch_prevents_match() { let pat = r".*[a&&b]A|B"; let hay = "B"; let re = Regex::new(pat).unwrap(); assert!(re.is_match(hay)); } �����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������regex-1.12.2/tests/replace.rs�����������������������������������������������������������������������0000644�0000000�0000000�00000007121�10461020230�0014145�0����������������������������������������������������������������������������������������������������ustar �����������������������������������������������������������������0000000�0000000������������������������������������������������������������������������������������������������������������������������������������������������������������������������macro_rules! replace( ($name:ident, $which:ident, $re:expr, $search:expr, $replace:expr, $result:expr) => ( #[test] fn $name() { let re = regex::Regex::new($re).unwrap(); assert_eq!(re.$which($search, $replace), $result); } ); ); replace!(first, replace, r"[0-9]", "age: 26", "Z", "age: Z6"); replace!(plus, replace, r"[0-9]+", "age: 26", "Z", "age: Z"); replace!(all, replace_all, r"[0-9]", "age: 26", "Z", "age: ZZ"); replace!(groups, replace, r"([^ ]+)[ ]+([^ ]+)", "w1 w2", "$2 $1", "w2 w1"); replace!( double_dollar, replace, r"([^ ]+)[ ]+([^ ]+)", "w1 w2", "$2 $$1", "w2 $1" ); // replace!(adjacent_index, replace, // r"([^aeiouy])ies$", "skies", "$1y", "sky"); replace!( named, replace_all, r"(?P<first>[^ ]+)[ ]+(?P<last>[^ ]+)(?P<space>[ ]*)", "w1 w2 w3 w4", "$last $first$space", "w2 w1 w4 w3" ); replace!( trim, replace_all, "^[ \t]+|[ \t]+$", " \t trim me\t \t", "", "trim me" ); replace!(number_hyphen, replace, r"(.)(.)", "ab", "$1-$2", "a-b"); // replace!(number_underscore, replace, r"(.)(.)", "ab", "$1_$2", "a_b"); replace!( simple_expand, replace_all, r"([a-z]) ([a-z])", "a b", "$2 $1", "b a" ); replace!( literal_dollar1, replace_all, r"([a-z]+) ([a-z]+)", "a b", "$$1", "$1" ); replace!( literal_dollar2, replace_all, r"([a-z]+) ([a-z]+)", "a b", "$2 $$c $1", "b $c a" ); replace!( no_expand1, replace, r"([^ ]+)[ ]+([^ ]+)", "w1 w2", regex::NoExpand("$2 $1"), "$2 $1" ); replace!( no_expand2, replace, r"([^ ]+)[ ]+([^ ]+)", "w1 w2", regex::NoExpand("$$1"), "$$1" ); replace!( closure_returning_reference, replace, r"([0-9]+)", "age: 26", |captures: ®ex::Captures<'_>| { captures[1][0..1].to_owned() }, "age: 2" ); replace!( closure_returning_value, replace, r"[0-9]+", "age: 26", |_captures: ®ex::Captures<'_>| "Z".to_owned(), "age: Z" ); // See https://github.com/rust-lang/regex/issues/314 replace!( match_at_start_replace_with_empty, replace_all, r"foo", "foobar", "", "bar" ); // See https://github.com/rust-lang/regex/issues/393 replace!(single_empty_match, replace, r"^", "bar", "foo", "foobar"); // See https://github.com/rust-lang/regex/issues/399 replace!( capture_longest_possible_name, replace_all, r"(.)", "b", "${1}a $1a", "ba " ); replace!( impl_string, replace, r"[0-9]", "age: 26", "Z".to_string(), "age: Z6" ); replace!( impl_string_ref, replace, r"[0-9]", "age: 26", &"Z".to_string(), "age: Z6" ); replace!( impl_cow_str_borrowed, replace, r"[0-9]", "age: 26", std::borrow::Cow::<'_, str>::Borrowed("Z"), "age: Z6" ); replace!( impl_cow_str_borrowed_ref, replace, r"[0-9]", "age: 26", &std::borrow::Cow::<'_, str>::Borrowed("Z"), "age: Z6" ); replace!( impl_cow_str_owned, replace, r"[0-9]", "age: 26", std::borrow::Cow::<'_, str>::Owned("Z".to_string()), "age: Z6" ); replace!( impl_cow_str_owned_ref, replace, r"[0-9]", "age: 26", &std::borrow::Cow::<'_, str>::Owned("Z".to_string()), "age: Z6" ); #[test] fn replacen_no_captures() { let re = regex::Regex::new(r"[0-9]").unwrap(); assert_eq!(re.replacen("age: 1234", 2, "Z"), "age: ZZ34"); } #[test] fn replacen_with_captures() { let re = regex::Regex::new(r"([0-9])").unwrap(); assert_eq!(re.replacen("age: 1234", 2, "${1}Z"), "age: 1Z2Z34"); } �����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������regex-1.12.2/tests/searcher.rs����������������������������������������������������������������������0000644�0000000�0000000�00000004522�10461020230�0014330�0����������������������������������������������������������������������������������������������������ustar �����������������������������������������������������������������0000000�0000000������������������������������������������������������������������������������������������������������������������������������������������������������������������������macro_rules! searcher { ($name:ident, $re:expr, $haystack:expr) => ( searcher!($name, $re, $haystack, vec vec![]); ); ($name:ident, $re:expr, $haystack:expr, $($steps:expr,)*) => ( searcher!($name, $re, $haystack, vec vec![$($steps),*]); ); ($name:ident, $re:expr, $haystack:expr, $($steps:expr),*) => ( searcher!($name, $re, $haystack, vec vec![$($steps),*]); ); ($name:ident, $re:expr, $haystack:expr, vec $expect_steps:expr) => ( #[test] #[allow(unused_imports)] fn $name() { use std::str::pattern::{Pattern, Searcher}; use std::str::pattern::SearchStep::{Match, Reject, Done}; let re = regex::Regex::new($re).unwrap(); let mut se = re.into_searcher($haystack); let mut got_steps = vec![]; loop { match se.next() { Done => break, step => { got_steps.push(step); } } } assert_eq!(got_steps, $expect_steps); } ); } searcher!(searcher_empty_regex_empty_haystack, r"", "", Match(0, 0)); searcher!( searcher_empty_regex, r"", "ab", Match(0, 0), Reject(0, 1), Match(1, 1), Reject(1, 2), Match(2, 2) ); searcher!(searcher_empty_haystack, r"\d", ""); searcher!(searcher_one_match, r"\d", "5", Match(0, 1)); searcher!(searcher_no_match, r"\d", "a", Reject(0, 1)); searcher!( searcher_two_adjacent_matches, r"\d", "56", Match(0, 1), Match(1, 2) ); searcher!( searcher_two_non_adjacent_matches, r"\d", "5a6", Match(0, 1), Reject(1, 2), Match(2, 3) ); searcher!(searcher_reject_first, r"\d", "a6", Reject(0, 1), Match(1, 2)); searcher!( searcher_one_zero_length_matches, r"\d*", "a1b2", Match(0, 0), // ^ Reject(0, 1), // a Match(1, 2), // a1 Reject(2, 3), // a1b Match(3, 4), // a1b2 ); searcher!( searcher_many_zero_length_matches, r"\d*", "a1bbb2", Match(0, 0), // ^ Reject(0, 1), // a Match(1, 2), // a1 Reject(2, 3), // a1b Match(3, 3), // a1bb Reject(3, 4), // a1bb Match(4, 4), // a1bbb Reject(4, 5), // a1bbb Match(5, 6), // a1bbba ); searcher!( searcher_unicode, r".+?", "โ… 1โ…ก2", Match(0, 3), Match(3, 4), Match(4, 7), Match(7, 8) ); ������������������������������������������������������������������������������������������������������������������������������������������������������������������������������regex-1.12.2/tests/suite_bytes.rs�������������������������������������������������������������������0000644�0000000�0000000�00000007735�10461020230�0015104�0����������������������������������������������������������������������������������������������������ustar �����������������������������������������������������������������0000000�0000000������������������������������������������������������������������������������������������������������������������������������������������������������������������������use { anyhow::Result, regex::bytes::{Regex, RegexBuilder}, regex_test::{ CompiledRegex, Match, RegexTest, Span, TestResult, TestRunner, }, }; /// Tests the default configuration of the hybrid NFA/DFA. #[test] fn default() -> Result<()> { let mut runner = TestRunner::new()?; runner .expand(&["is_match", "find", "captures"], |test| test.compiles()) .blacklist_iter(super::BLACKLIST) .test_iter(crate::suite()?.iter(), compiler) .assert(); Ok(()) } fn run_test(re: &Regex, test: &RegexTest) -> TestResult { match test.additional_name() { "is_match" => TestResult::matched(re.is_match(test.haystack())), "find" => TestResult::matches( re.find_iter(test.haystack()) .take(test.match_limit().unwrap_or(std::usize::MAX)) .map(|m| Match { id: 0, span: Span { start: m.start(), end: m.end() }, }), ), "captures" => { let it = re .captures_iter(test.haystack()) .take(test.match_limit().unwrap_or(std::usize::MAX)) .map(|caps| testify_captures(&caps)); TestResult::captures(it) } name => TestResult::fail(&format!("unrecognized test name: {name}")), } } /// Converts the given regex test to a closure that searches with a /// `bytes::Regex`. If the test configuration is unsupported, then a /// `CompiledRegex` that skips the test is returned. fn compiler( test: &RegexTest, _patterns: &[String], ) -> anyhow::Result<CompiledRegex> { let skip = Ok(CompiledRegex::skip()); // We're only testing bytes::Regex here, which supports one pattern only. let pattern = match test.regexes().len() { 1 => &test.regexes()[0], _ => return skip, }; // We only test is_match, find_iter and captures_iter. All of those are // leftmost searches. if !matches!(test.search_kind(), regex_test::SearchKind::Leftmost) { return skip; } // The top-level single-pattern regex API always uses leftmost-first. if !matches!(test.match_kind(), regex_test::MatchKind::LeftmostFirst) { return skip; } // The top-level regex API always runs unanchored searches. ... But we can // handle tests that are anchored but have only one match. if test.anchored() && test.match_limit() != Some(1) { return skip; } // We don't support tests with explicit search bounds. We could probably // support this by using the 'find_at' (and such) APIs. let bounds = test.bounds(); if !(bounds.start == 0 && bounds.end == test.haystack().len()) { return skip; } // The bytes::Regex API specifically does not support enabling UTF-8 mode. // It could I suppose, but currently it does not. That is, it permits // matches to have offsets that split codepoints. if test.utf8() { return skip; } // If the test requires Unicode but the Unicode feature isn't enabled, // skip it. This is a little aggressive, but the test suite doesn't // have any easy way of communicating which Unicode features are needed. if test.unicode() && !cfg!(feature = "unicode") { return skip; } let re = RegexBuilder::new(pattern) .case_insensitive(test.case_insensitive()) .unicode(test.unicode()) .line_terminator(test.line_terminator()) .build()?; Ok(CompiledRegex::compiled(move |test| run_test(&re, test))) } /// Convert `Captures` into the test suite's capture values. fn testify_captures( caps: ®ex::bytes::Captures<'_>, ) -> regex_test::Captures { let spans = caps.iter().map(|group| { group.map(|m| regex_test::Span { start: m.start(), end: m.end() }) }); // This unwrap is OK because we assume our 'caps' represents a match, and // a match always gives a non-zero number of groups with the first group // being non-None. regex_test::Captures::new(0, spans).unwrap() } �����������������������������������regex-1.12.2/tests/suite_bytes_set.rs���������������������������������������������������������������0000644�0000000�0000000�00000005067�10461020230�0015753�0����������������������������������������������������������������������������������������������������ustar �����������������������������������������������������������������0000000�0000000������������������������������������������������������������������������������������������������������������������������������������������������������������������������use { anyhow::Result, regex::bytes::{RegexSet, RegexSetBuilder}, regex_test::{CompiledRegex, RegexTest, TestResult, TestRunner}, }; /// Tests the default configuration of the hybrid NFA/DFA. #[test] fn default() -> Result<()> { let mut runner = TestRunner::new()?; runner .expand(&["is_match", "which"], |test| test.compiles()) .blacklist_iter(super::BLACKLIST) .test_iter(crate::suite()?.iter(), compiler) .assert(); Ok(()) } fn run_test(re: &RegexSet, test: &RegexTest) -> TestResult { match test.additional_name() { "is_match" => TestResult::matched(re.is_match(test.haystack())), "which" => TestResult::which(re.matches(test.haystack()).iter()), name => TestResult::fail(&format!("unrecognized test name: {name}")), } } /// Converts the given regex test to a closure that searches with a /// `bytes::Regex`. If the test configuration is unsupported, then a /// `CompiledRegex` that skips the test is returned. fn compiler( test: &RegexTest, _patterns: &[String], ) -> anyhow::Result<CompiledRegex> { let skip = Ok(CompiledRegex::skip()); // The top-level RegexSet API only supports "overlapping" semantics. if !matches!(test.search_kind(), regex_test::SearchKind::Overlapping) { return skip; } // The top-level RegexSet API only supports "all" semantics. if !matches!(test.match_kind(), regex_test::MatchKind::All) { return skip; } // The top-level RegexSet API always runs unanchored searches. if test.anchored() { return skip; } // We don't support tests with explicit search bounds. let bounds = test.bounds(); if !(bounds.start == 0 && bounds.end == test.haystack().len()) { return skip; } // The bytes::Regex API specifically does not support enabling UTF-8 mode. // It could I suppose, but currently it does not. That is, it permits // matches to have offsets that split codepoints. if test.utf8() { return skip; } // If the test requires Unicode but the Unicode feature isn't enabled, // skip it. This is a little aggressive, but the test suite doesn't // have any easy way of communicating which Unicode features are needed. if test.unicode() && !cfg!(feature = "unicode") { return skip; } let re = RegexSetBuilder::new(test.regexes()) .case_insensitive(test.case_insensitive()) .unicode(test.unicode()) .line_terminator(test.line_terminator()) .build()?; Ok(CompiledRegex::compiled(move |test| run_test(&re, test))) } �������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������regex-1.12.2/tests/suite_string.rs������������������������������������������������������������������0000644�0000000�0000000�00000010120�10461020230�0015242�0����������������������������������������������������������������������������������������������������ustar �����������������������������������������������������������������0000000�0000000������������������������������������������������������������������������������������������������������������������������������������������������������������������������use { anyhow::Result, regex::{Regex, RegexBuilder}, regex_test::{ CompiledRegex, Match, RegexTest, Span, TestResult, TestRunner, }, }; /// Tests the default configuration of the hybrid NFA/DFA. #[test] fn default() -> Result<()> { let mut runner = TestRunner::new()?; runner .expand(&["is_match", "find", "captures"], |test| test.compiles()) .blacklist_iter(super::BLACKLIST) .test_iter(crate::suite()?.iter(), compiler) .assert(); Ok(()) } fn run_test(re: &Regex, test: &RegexTest) -> TestResult { let hay = match std::str::from_utf8(test.haystack()) { Ok(hay) => hay, Err(err) => { return TestResult::fail(&format!( "haystack is not valid UTF-8: {err}", )); } }; match test.additional_name() { "is_match" => TestResult::matched(re.is_match(hay)), "find" => TestResult::matches( re.find_iter(hay) .take(test.match_limit().unwrap_or(std::usize::MAX)) .map(|m| Match { id: 0, span: Span { start: m.start(), end: m.end() }, }), ), "captures" => { let it = re .captures_iter(hay) .take(test.match_limit().unwrap_or(std::usize::MAX)) .map(|caps| testify_captures(&caps)); TestResult::captures(it) } name => TestResult::fail(&format!("unrecognized test name: {name}")), } } /// Converts the given regex test to a closure that searches with a /// `bytes::Regex`. If the test configuration is unsupported, then a /// `CompiledRegex` that skips the test is returned. fn compiler( test: &RegexTest, _patterns: &[String], ) -> anyhow::Result<CompiledRegex> { let skip = Ok(CompiledRegex::skip()); // We're only testing bytes::Regex here, which supports one pattern only. let pattern = match test.regexes().len() { 1 => &test.regexes()[0], _ => return skip, }; // We only test is_match, find_iter and captures_iter. All of those are // leftmost searches. if !matches!(test.search_kind(), regex_test::SearchKind::Leftmost) { return skip; } // The top-level single-pattern regex API always uses leftmost-first. if !matches!(test.match_kind(), regex_test::MatchKind::LeftmostFirst) { return skip; } // The top-level regex API always runs unanchored searches. ... But we can // handle tests that are anchored but have only one match. if test.anchored() && test.match_limit() != Some(1) { return skip; } // We don't support tests with explicit search bounds. We could probably // support this by using the 'find_at' (and such) APIs. let bounds = test.bounds(); if !(bounds.start == 0 && bounds.end == test.haystack().len()) { return skip; } // The Regex API specifically does not support disabling UTF-8 mode because // it can only search &str which is always valid UTF-8. if !test.utf8() { return skip; } // If the test requires Unicode but the Unicode feature isn't enabled, // skip it. This is a little aggressive, but the test suite doesn't // have any easy way of communicating which Unicode features are needed. if test.unicode() && !cfg!(feature = "unicode") { return skip; } let re = RegexBuilder::new(pattern) .case_insensitive(test.case_insensitive()) .unicode(test.unicode()) .line_terminator(test.line_terminator()) .build()?; Ok(CompiledRegex::compiled(move |test| run_test(&re, test))) } /// Convert `Captures` into the test suite's capture values. fn testify_captures(caps: ®ex::Captures<'_>) -> regex_test::Captures { let spans = caps.iter().map(|group| { group.map(|m| regex_test::Span { start: m.start(), end: m.end() }) }); // This unwrap is OK because we assume our 'caps' represents a match, and // a match always gives a non-zero number of groups with the first group // being non-None. regex_test::Captures::new(0, spans).unwrap() } ������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������regex-1.12.2/tests/suite_string_set.rs��������������������������������������������������������������0000644�0000000�0000000�00000005304�10461020230�0016125�0����������������������������������������������������������������������������������������������������ustar �����������������������������������������������������������������0000000�0000000������������������������������������������������������������������������������������������������������������������������������������������������������������������������use { anyhow::Result, regex::{RegexSet, RegexSetBuilder}, regex_test::{CompiledRegex, RegexTest, TestResult, TestRunner}, }; /// Tests the default configuration of the hybrid NFA/DFA. #[test] fn default() -> Result<()> { let mut runner = TestRunner::new()?; runner .expand(&["is_match", "which"], |test| test.compiles()) .blacklist_iter(super::BLACKLIST) .test_iter(crate::suite()?.iter(), compiler) .assert(); Ok(()) } fn run_test(re: &RegexSet, test: &RegexTest) -> TestResult { let hay = match std::str::from_utf8(test.haystack()) { Ok(hay) => hay, Err(err) => { return TestResult::fail(&format!( "haystack is not valid UTF-8: {err}", )); } }; match test.additional_name() { "is_match" => TestResult::matched(re.is_match(hay)), "which" => TestResult::which(re.matches(hay).iter()), name => TestResult::fail(&format!("unrecognized test name: {name}")), } } /// Converts the given regex test to a closure that searches with a /// `bytes::Regex`. If the test configuration is unsupported, then a /// `CompiledRegex` that skips the test is returned. fn compiler( test: &RegexTest, _patterns: &[String], ) -> anyhow::Result<CompiledRegex> { let skip = Ok(CompiledRegex::skip()); // The top-level RegexSet API only supports "overlapping" semantics. if !matches!(test.search_kind(), regex_test::SearchKind::Overlapping) { return skip; } // The top-level RegexSet API only supports "all" semantics. if !matches!(test.match_kind(), regex_test::MatchKind::All) { return skip; } // The top-level RegexSet API always runs unanchored searches. if test.anchored() { return skip; } // We don't support tests with explicit search bounds. let bounds = test.bounds(); if !(bounds.start == 0 && bounds.end == test.haystack().len()) { return skip; } // The Regex API specifically does not support disabling UTF-8 mode because // it can only search &str which is always valid UTF-8. if !test.utf8() { return skip; } // If the test requires Unicode but the Unicode feature isn't enabled, // skip it. This is a little aggressive, but the test suite doesn't // have any easy way of communicating which Unicode features are needed. if test.unicode() && !cfg!(feature = "unicode") { return skip; } let re = RegexSetBuilder::new(test.regexes()) .case_insensitive(test.case_insensitive()) .unicode(test.unicode()) .line_terminator(test.line_terminator()) .build()?; Ok(CompiledRegex::compiled(move |test| run_test(&re, test))) } ��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������