Comments (8)
Yeah the reason why I wasn't immediately rejecting my idea because of the index issue is because searching for an empty needle in an empty haystack returns a starting offset. But I think the right way to look at that is that results from a substring search are not meant to be indexed, but rather, sliced. And of course &haystack[0..needle.len()]
is valid when both haystack
and needle
are empty.
from bstr.
I think you might have convinced me. Eventually, empty classes in regexes will be allowed, and yeah, they are supposed to be treated as "fail" instructions: they can never match anything.
from bstr.
Hmm... I'm not sure I agree. The semantics of find_byteset are finding the position of a byte which is a member of a set. No bytes are members of the empty set, so returning position 0 for it would be incorrect, no?
That said, I'd agree with you for find_not_byteset.
from bstr.
I think I see "empty set" as equivalent to "empty string," and "empty set" matches at every position, just like "empty string" does.
from bstr.
Actually, I'm not sure if I 100% agree for find_not_byteset — the second part:
use bstr::{B, ByteSlice};
fn main() {
let haystack = B("");
println!("{:?}", haystack.find_not_byteset(""));
}
This returns None, which I think is correct, since I think it's probably always incorrect if it returns an index that isn't addressable in the slice — byteset matches are always a span of at least one byte, even for find_not_byteset.
In regex terms, find_not_byteset("") is analogous to matching r"\p{Any}"
, and find_byteset("")
is analogous to matching r"\P{Any}"
(or something — the analogy doesn't 100% work, since its matching bytes and not characters). It's not analogous to matching the empty needle.
That is to say, AFAICT I don't see any bugs in find_byteset and find_not_byteset current semantics for empty sets.
from bstr.
Also, I guess the analogy doesn't totally work because at least for Rust regex, \P{Any}
tells me that "empty character classes are not allowed". Which is fair, I guess.
from bstr.
Anyway, the documentation of find_byteset says
Returns the index of the first occurrence of any of the bytes in the provided set.
Personally, I think this means that it shouldn't return:
- An index that isn't in-bounds of the string.
- An index for which, if you constructed (say) a BTreeSet containing all the bytes,
set.contains(&bstr[index])
would return false.
And thus the current semantics are at least in line with the documentation.
from bstr.
Closing this as invalid, since I was very wrong. I am updating the docs to make sure these cases are more explicitly documented.
I've added these two lines to the find_byteset
example:
/// // The empty byteset never matches.
/// assert_eq!(None, b"abc".find_byteset(b""));
/// assert_eq!(None, b"".find_byteset(b""));
I've also added these two lines to the find_not_byteset
example:
/// // The negation of the empty byteset matches everything.
/// assert_eq!(Some(0), b"abc".find_not_byteset(b""));
/// // But an empty string never contains anything.
/// assert_eq!(None, b"".find_not_byteset(b""));
from bstr.
Related Issues (20)
- Complementary ByteSlice functions addition - find_not_byte / rfind_not_byte HOT 1
- Use clippy in CI? HOT 2
- Intradoc links are broken when building with no default features HOT 3
- re-enable miri tests
- Accept array of str for split_str HOT 1
- remove `Borrow<BStr> for String` impls (and similar) in a semver compatible release HOT 9
- Add unescape_ascii fn HOT 4
- Display implementation doesn't respect Formatter options
- `bstr::Split` should implement clone. HOT 1
- Incorrect Output rfind() HOT 7
- Should the documentation be updated to take into_encoded_bytes and related functions into account?
- Grapheme segmentation is 1.2x-8x slower than `unicode-segmentation` in benchmarks HOT 3
- When stdin is a terminal, for_byte_record_with_terminator() does not exit immediately on Control-D HOT 4
- `[u8]::utf8_chunks` in std will conflict with the definition in `bstr`
- Can `memchr` be an optional dependency? HOT 5
- Support for `databake`, `writeable`, `zerovec` traits? HOT 3
- Support for the Bytes crate HOT 6
- Unsound usages of unsafe implementation from `u8` to `usize` HOT 3
- Unsound usages of unsafe implementation from `[u8]` to `BStr` HOT 3
- Discrepancy in upper/lower case in `impl fmt::Debug for BStr` HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from bstr.