Git Product home page Git Product logo

Comments (6)

emanuele6 avatar emanuele6 commented on June 7, 2024

I don't understand what --verify scheme is supposed to do.

If sample was meant to be the host name, you should have used -s 'host=sample' or -s "host=$sample", and you would have gotten:

$ trurl -s 'host=sample'
trurl error: not enough input for a URL
trurl error: Try trurl -h for help

With trurl sample you are specifying sample as a URL, and libcurl will parse it as a URL.

Can you specify an actual use case for this to clarify what you want it to do? Why did you think this would be useful?


In any case, in my opinion, we should definitely not implement wacky optional option arguments like those. If you want optional arguments, use something like --verify=foo to specify the optional argument or just use another option.

A --verify that may or may not interpret the next argument as an option argument, or that must be specified at the end if you don't want to pass the optional argument (making it impossible to use with --) are just bad non-script-friendly option styles.

from trurl.

jacobmealey avatar jacobmealey commented on June 7, 2024

Sorry for the confusion, scheme here would be the preceding http://, and sample is a generic URL. The use case I'm thinking of is a script for extracting all the "valid" URLs in a large text file. trurl has a lot of the mechanics in place for something like this, but it would be kind of roundabout, so it would be nice to have something that can verify the string is a URL that meets some requirements (like having a preceding HTTP or something). It might make more sense to do it with something similar to --get {raw:port}. Let me know if you have any more questions, I feel like I am missing some keywords in my explanation.

from trurl.

emanuele6 avatar emanuele6 commented on June 7, 2024

Sorry for the confusion, scheme here would be the preceding http://, and sample is a generic URL.

Can you provide an example command line of what you mean withhttp:// as scheme?
You mean like trurl -u "$url" --verify=http://?

The use case I'm thinking of is a script for extracting all the "valid" URLs in a large text file.

This doesn't really help much understand what you mean.

trurl already discards unparsable URLs in input files:

$ cat foo.txt
https://example.org
file:///home/emanuele6/./foo/..
\\
foo
:/lol/
ftp://curl.se
$ trurl -f foo.txt
https://example.org/
file:///home/emanuele6/
trurl note: Bad hostname [\\]
http://foo/
trurl note: Port number was not a decimal number between 0 and 65535 [:/lol/]
ftp://curl.se/
$ trurl -f foo.txt 2>/dev/null
https://example.org/
file:///home/emanuele6/
http://foo/
ftp://curl.se/

If you are asking for an option that would count URLs without a scheme invalid, I think it is fine to add an option like --strict/--no-guess-scheme that makes trurl call curl_url_set() without CURLU_GUESS_SCHEME to achieve that. It would be a great addition.
We could also add a --no-credentials one that makes it also use CURLU_DISALLOW_USER.

I don't really see the connection with --verify though, --verify does something completely different: it makes trurl abort and return non-zero at the first invalid URL; e.g. from the example above trurl --verify -f foo.txt 2>/dev/null will output:

https://example.org
file:///home/emanuele6/

trurl --verify, instead of ignoring \\ and :/lol/, and printing a warning for them, aborts as soon as the first invalid URL (\\ in this case) is encountered, so http://foo/ and ftp://curl.se/ never get printed.

trurl has a lot of the mechanics in place for something like this, but it would be kind of roundabout, so it would be nice to have something that can verify the string is a URL that meets some requirements (like having a preceding HTTP or something).

Isn't that the same as the --filter proposed in #159?
trurl will output only valid URLs, so in the specific case of checking the scheme you can just use even grep e.g. to filter non-https:// URLs trurl -f urls.txt 2>/dev/null | grep '^https://'.
And if you need something more complex, like only URLs with a username specified, you can use JSON output and a tool like jq:

# only URLs with a embedded username
trurl --json -f urls.txt 2>/dev/null | jq -r '.[] | select(.user).url'
# only URLs with a embedded username that is in the users array
users=( emanuele6 jacobmealey )
trurl --json -f urls.txt 2>/dev/null | jq -r --args '.[] | select(.user | IN($ARGS.positional[])).url' -- "${users[@]}"
# or, streaming solution:
trurl --json -f urls.txt 2>/dev/null |
jq -nr --stream 'fromstream(1 | truncate_stream(inputs)) | select(.user | IN("tom", "bob").url'

It might make more sense to do it with something similar to --get {raw:port}

How? -g '{if:scheme=https:{url}:{nonewline}}'? xD
Maybe I am not understanding what you are proposing to add again.

from trurl.

jacobmealey avatar jacobmealey commented on June 7, 2024

If you are asking for an option that would count URLs without a scheme invalid, I think it is fine to add an option like --strict/--no-guess-scheme that makes trurl call curl_url_set() without CURLU_GUESS_SCHEME to achieve that. Adding more options to configure the CURLU_* flags passed to curl_url_set() would definitely be a nice addition!

this is exactly what I mean, yes! perhaps I should have omitted --verify in my original proposal. THANK YOU

from trurl.

emanuele6 avatar emanuele6 commented on June 7, 2024

The only meaningful options to configure the flags passed to curl_url_set() that are not already implemented are:

  • one to not pass CURLU_GUESS_SCHEME 6909cee
  • one to pass CURLU_DISALLOW_USER, disallows embedded user:password@ in URLs
  • one to pass CURLU_PATH_AS_IS, skips path normalisation: https://foo/a/b/../.././c/d/e/foo/../.. remains unchanged instead of becoming https://foo/c/d/.
  • CURLU_DEFAULT_SCHEME, URLs are assumed to be https://. Note that this is different from CURLU_GUESS_SCHEME that assumes http:// in general, and ftp:// if the host name starts with ftp., imap:// if the host name starts with imap., etc.

The one to not pass CURLU_GUESS_SCHEME can definitely be very helpful. I am not too sure about the usefulness of the other ones.

from trurl.

jacobmealey avatar jacobmealey commented on June 7, 2024

I think this was fixed in #195 so I'm closing it.

from trurl.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.