Comments (5)
Is there a standard/well-known header or something for retrieving the canonical URL? Or will we have to special-case all of the things you've listed?
from camille.
There is an official way on the web to declare canonical URLs for a website. It consists of a tag of the form <link rel=“canonical” href=“https://example.com/sample-page/” />
in the HTML content of the page.
This official way is what search engines use to index pages using just their canonical URL, so this is pretty widespread and used by most websites.
What I don't know is if there's some nice service/API we could use (maybe a tool provided publicly by some of the most common search engines?) to which we could send an arbitrary URL, let the service get the content and extract the tag from it, and return the canonical URL to us.
Using such a service if it exists would be way better than making Camille do the parsing of the HTML herself, because making a request to load the whole HTML content of an URL just to extract the canonical tag from it would take a lot of time, bandwidth and credits for just that piece of info, while I'm pretty sure search engines cache that info for all the sites they index… and there's some chance they make that info then directly available to anyone via some API
from camille.
my bet is any such service would be longer than doing this simple call. the html isn't that big and the other service is just doing the work as well so would likely be shorter.
big reason to use services for stuff is they can normalize edge cases well, this is a very straightforward xpath lookup
from camille.
@tal the main benefit I was thinking about if we were to use a service provided by a search engine is that hopefully it would return cached value, without the need for them to do the request + parsing when we query them, because they would already have done that long step ages ago when they indexed the page in their own search engine databases and all
Also, in practice there are indeed edge cases. From my quick browsing about the topic, sure rel='canonical'
is the most common and official way to do it, but there are still other ways like 301 redirections and others. So implementing it ourselves within Camille might be reinventing the wheel for all those cases in addition to not taking advantage of the cache DBs of search engines which already did the work while indexing…
from camille.
I'm weary of handling those services without much benefit because you still have to handle all hose conditions being returned by the service. Can't assume 100% uptime and good behavior.
But it's up to whoever implements to decide. Scraping the web is super easy and a lot faster than I think you're worried it'd be.
from camille.
Related Issues (20)
- Implement a time limit for links Camille responds to HOT 2
- Reimplement our hey guys Slackbot in Camille HOT 1
- Links in code snippets get picked up HOT 1
- Announce changed emoji
- Announce new channels HOT 1
- Move invite message from Glitch to Camille
- The link service shouldn't tell a user about a link they themselves shared before
- Camille double-counts ++ commands if a link gets unfurled HOT 1
- Have Camille auto-join all new channels
- Provide an API to tell about already-posted links
- Feature: Stock symbol links HOT 1
- Feature: nudge for ableist language
- Let users delete Camille dupe messages
- Locked channels HOT 2
- Ignore links in backticks "`"
- Cap max points ++ allocates to 10
- Camille's link checking should not out private channels
- Workflows
- Alternative Camille Points Strategies HOT 6
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from camille.