Git Product home page Git Product logo

Comments (3)

emanuele6 avatar emanuele6 commented on May 28, 2024 2

I was wondering if trurl allows to remove the scheme (to dedup them later)

Oh, duh. Sorry, your example also had URLs that were identical except for the scheme, so I don't know how i missed that. :p

Still, I don't understand why you are trying to only remove the scheme.

In that case, you can simply set the scheme to the desired value e.g. http:// and then pipe to sort -u or awk '!seen[$0]++', no?

$ trurl -f - -s 'scheme=http' < ./test | sort -u

If you want to do something more complex like discarding non-http/https URLs, and keeping https:// if both http:// and https:// are specified, you can use jq:

$ trurl --json -f - < ./test | jq -r 'group_by(del(.url, .scheme, .raw_port))[] | first(("https", "http") as $s | .[] | select(.scheme == $s).url)'

from trurl.

emanuele6 avatar emanuele6 commented on May 28, 2024

trurl only outputs, unless you use -g or --json, valid URLs, one for each line of output.
--set, --redirect, --trim, --append, --iterate, and --sort-query, only modify the URL in a way that keeps it valid, and re-parsable by libcurl (with the current flags: --accept-space, --no-guess-scheme, etc.).

You cannot use a --trim command that outputs something without a scheme, because that is not a valid URL.

If your goal is actually to only print out only the {host} and {path} parts of the URL, you can use -g '{:host}{:path}':

$ cat test
http://a.example.com/test/foo/./bar/..
xyz.example.org
https://b.example.com:20/test?hi#hello
ftp://[email protected]/hey.txt
$ trurl -f - < ./test
http://a.example.com/test/foo
http://xyz.example.org/
https://b.example.com:20/test?hi#hello
ftp://[email protected]/hey.txt
$ trurl -f - -g '{:host}{:path}' < ./test
a.example.com/test/foo
xyz.example.org/
b.example.com:20/test
c.example.org/hey.txt

You may also use {:host}{:path}{:query}{:fragment} since {query} and {fragment} expand with ?/# at the start, but if you also want to include also other stuff like {user} and {pass} it gets tricky, because if you use -g '{:user}:{:pass}@{:host}{:path}' it gets tricky since trurl would output :@a.example.org/foo for http://a.example.org/foo which is probably not what you want.

Maybe the -g command could be improved to allow printng a full URL with some parts omitted somehow to satisfy your use case, but I don't know how that would be useful. Can you explain why you are doing this?

Anyway, as a workaround, in the specific case of removing a scheme, if you really want to remove the scheme and nothing else from a full URL for some reason, I guess you can use something like this:

$ trurl -f - < ./test | sed -n 's@^[^:]*://@@p'
a.example.com/test/foo
xyz.example.org
b.example.com:20/test?hi#hello
[email protected]/hey.txt
$ # or to only print http/https URLs, without the scheme
$ trurl -f - < ./test | sed -n 's@^https\{0,1\}://@@p'
a.example.com/test/foo
xyz.example.org
b.example.com:20/test?hi#hello
$ # notice that trurl guessed the scheme for xyz.example.org as http://
$ # so it is printed.

This should be fine since trurl will only output lines that contain one full valid URL, and discard invalid URLs in the input, so you can assume that the scheme will not contain colons, and removing everything before the first ":", and the "://" after that will only remove the scheme.

from trurl.

bagder avatar bagder commented on May 28, 2024

I'm with @emanuele6. You can do this already with a few very simple workarounds: either decide to use -g and output all parts except the scheme, or just set a fixed scheme before you compare. I think "trurl only outputs valid URLs" is a good idea to stick to.

from trurl.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.