Git Product home page Git Product logo

Comments (12)

hzrd149 avatar hzrd149 commented on June 12, 2024 2

Man it seems to be specifically news sites that are the worst offenders when it comes to breaking the URL spec
Fix should be out in the alpha version

from nostrudel.

psic4t avatar psic4t commented on June 12, 2024 1

I'm with you. Let's just add _ and , to your last version:

https://regex101.com/r/GOWi8J/4

Thia should fix most issues.

from nostrudel.

hzrd149 avatar hzrd149 commented on June 12, 2024

This is something that's been on my todo list for a while. @ also breaks links. the issue is how I have the link RegExp written
To fix this I need to search for a better link regexp that supports non-english characters and other symbols

from nostrudel.

psic4t avatar psic4t commented on June 12, 2024

Can you post this regex (or code location)? Maybe I can find something.

from nostrudel.

hzrd149 avatar hzrd149 commented on June 12, 2024

the regexp is located here https://github.com/hzrd149/nostrudel/blob/master/src/helpers/embeds.ts#L56

It needs to be simplified and Unicode support added. however it also needs to avoid false positives as much as possible.
An example of a false positive would be http://sub.example.verylongingaliddomain/index.html or https://example.com/???test=0
Or even two urls back to back without a space http://example.comhttp://example.com

I haven't figured out a good regexp yet, but I know one has to exist. if not how would other social media sites auto detect links

from nostrudel.

psic4t avatar psic4t commented on June 12, 2024

Thanks, I'll try some stuff over the next days.

from nostrudel.

psic4t avatar psic4t commented on June 12, 2024

So, I tried some stuff - best I came up with is this:

https?:\/\/([\w \.-]+\.\w+)(\S*)

Check it here: https://regex101.com/r/J4hHHn/1

http://sub.example.verylongingaliddomain/index.html can be valid as there are TLDs like ".cancerresearch" which is already 15 chars long.

https://example.com/???test=0 seems to be a special case. You can prevent it but I guess it's not worth the effort.

http://example.comhttp://example.com must be valid because of this: https://example.com?host=https://example.com

What do you think?

from nostrudel.

hzrd149 avatar hzrd149 commented on June 12, 2024

That's a good start, although the use of \S (not white space) picks up some of characters like ) or , after the url which pare pretty common when putting the url in brackets
also \w includes _ which technically is invalid for domains

I replace the use of \w with a-zA-Z0-9 so it would not include _ and \S with \p{Letter}\p{Number} which should include any Unicode letter or number characters (not just English) https://www.regular-expressions.info/unicode.html#category
I also added a lot more example of URLs and false positives

https://regex101.com/r/GOWi8J/1

What do you think? can you think of any other strange URL formats that might need to be considered?

from nostrudel.

psic4t avatar psic4t commented on June 12, 2024

I think you're right on the domain part with \w, but we need \S* for the path, because it's perfectly fine to use ();,._[] etc. in URL params.

Please check https://regex101.com/r/GOWi8J/2 - I added just two real links, which aren't working.

When I put back the \S part, all links work, but markdown or comma separated links wont. -> https://regex101.com/r/GOWi8J/3

I'll try to find a way to make at least markdown work (but it's not supported in Nostr anyways, right?).

from nostrudel.

hzrd149 avatar hzrd149 commented on June 12, 2024

Fixing both links is pretty easy, just needed to add _and , to the list of accepted characters. although it will break the , separated URLs ( but thats find because github dose not event support that )

I'm not sure about using ();,._[] in URL params though, I know they can be used, but I believing they have to be escaped. either way Ive seen more markdown and links surrounded by () than I've seen those characters used in URL params.
I'm hesitant to use \S is because it covers too much and it think it would be better for a few links to be broken then to have it select some of the text after the link

Test
https://example.com,https://example.com
https://example.com)

from nostrudel.

hzrd149 avatar hzrd149 commented on June 12, 2024

forgot to close this issue, but the fix for this was released a few days ago

from nostrudel.

psic4t avatar psic4t commented on June 12, 2024

I have to reopen this. Several news sites use tildes in image links. Can we include "~" in the regex?

Sample:
https://nostrudel.ninja/#/n/note1fquun6a9hjcsv0lcd8fafx53zqepqwf5xm6arez7sdzxuscpz28sq79c2g

from nostrudel.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.