opawg / user-agents Goto Github PK
View Code? Open in Web Editor NEWAn open, platform-agnostic list of user-agent and referrer regexes for use in podcast analytics services
License: MIT License
An open, platform-agnostic list of user-agent and referrer regexes for use in podcast analytics services
License: MIT License
We are looking at synchronizing our user agent list with this one and would find it very useful if there was a consistent unique identifier (GUID) associated to each record. Such an identifier would be also be useful for the situation where the application name changes also.
GUID: In this case would be a 128-bit integer number used to identify the user agent with a well-defined sequence of 32 hexadecimal digits grouped as 8-4-4-4-12
Example of the proposed addition to the first 2 records found in the json...
"guid": "f93bfaff-f0ac-4e44-bb52-2ca0aafcbd01",
[ { "guid": "f93bfaff-f0ac-4e44-bb52-2ca0aafcbd01", "user_agents": [ "^Acast.+[Aa]ndroid" ], "app": "Acast", "device": "phone", "os": "android" }, { "guid": "476757ae-28b4-47ed-94dd-753cf4832cdb", "user_agents": [ "^Acast.+iOS" ], "app": "Acast", "device": "phone", "os": "ios" }, ...]
(also attached)
This would allow a developer to pull this json file, check the guid with what was saved to identify when a record should be updated vs inserted as new. The GUID could be used to match the title of the app and the regular expression used and if the "user_agent" or the "app" changed, then the application would know what record to update.
The result would require that all new user agents added include a unique guid. Thoughts?
Thanks!
--Angelo
Heya - as of 0359556, it seems like we lost most/all of the use of \d+
to represent a number
As an example, before that commit, one of the Apple Podcasts' useragent matchers was
"user_agents": [
"^Podcasts/.*\\d$",
"^Balados/.*\\d$",
"^Podcasti/.*\\d$",
"^Podcastit/.*\\d$",
"^Podcasturi/.*\\d$",
"^Podcasty/.*\\d$",
"^Podcast’ler/.*\\d$",
"^Podkaster/.*\\d$",
"^Podcaster/.*\\d$",
"^Podcastok/.*\\d$",
"^Подкасти/.*\\d$",
"^Подкасты/.*\\d$",
"^פודקאסטים/.*\\d$",
"^البودكاست/.*\\d$",
"^पॉडकास्ट/.*\\d$",
"^พ็อดคาสท์/.*\\d$",
"^%E6%92%AD%E5%AE%A2/.*\\d$",
"^播客/.*\\d$",
"^팟캐스트/.*\\d$"
],
the same matcher is now:
"user_agents": [
"^Podcasts\/.*d$",
"^Balados\/.*d$",
"^Podcasti\/.*d$",
"^Podcastit\/.*d$",
"^Podcasturi\/.*d$",
"^Podcasty\/.*d$",
"^Podcast\u2019ler\/.*d$",
"^Podkaster\/.*d$",
"^Podcaster\/.*d$",
"^Podcastok\/.*d$",
"^\u041f\u043e\u0434\u043a\u0430\u0441\u0442\u0438\/.*d$",
"^\u041f\u043e\u0434\u043a\u0430\u0441\u0442\u044b\/.*d$",
"^\u05e4\u05d5\u05d3\u05e7\u05d0\u05e1\u05d8\u05d9\u05dd\/.*d$",
"^\u0627\u0644\u0628\u0648\u062f\u0643\u0627\u0633\u062a\/.*d$",
"^\u092a\u0949\u0921\u0915\u093e\u0938\u094d\u091f\/.*d$",
"^\u0e1e\u0e47\u0e2d\u0e14\u0e04\u0e32\u0e2a\u0e17\u0e4c\/.*d$",
"^%E6%92%AD%E5%AE%A2\/.*d$",
"^\u64ad\u5ba2\/.*d$",
"^\ud31f\uce90\uc2a4\ud2b8\/.*d$"
],
eg "^Podcasts\/.*d$"
only matches user agents ending with a "d", not a numeric.
I'd also suggest the \/
in there is a bit weird - it's harmless, but in JSON "/"
and "\/"
are equivalent, AFAIK.
Some apps and services use the referrer HTTP header, and this can also be useful information in terms of knowing which service has been used to play an episode.
Here are some of the ones I'm seeing, with the associated useragent.
Referrer | Useragent |
---|---|
https://www.gstatic.com/narrative_cast_receiver/receiver.html?feature=1 | Mozilla/5.0%2520(X11;%2520Linux%2520armv7l)%2520AppleWebKit/537.36%2520(KHTML,%2520like%2520Gecko)%2520Chrome/73.0.3683.47%2520Safari/537.36%2520CrKey/1.39.154941 |
https://breaker.audio | Breaker/iOS |
https://co.radiocut.fm/podcast-episode/how-is-google-podcasts-doing-and/?replay=1 | Mozilla/5.0%2520(Linux;%2520Android%25206.0.1;%2520Nexus%25205X%2520Build/MMB29P)%2520AppleWebKit/537.36%2520(KHTML,%2520like%2520Gecko)%2520Chrome/41.0.2272.96%2520Mobile%2520Safari/537.36%2520(compatible;%2520Googlebot/2.1;%2520+http://www.google.com/bot.html) |
https://podcasts.apple.com/gb/podcast/podnews-podcasting-news/id1325018583 | Mozilla/5.0%2520(Windows%2520NT%25206.1;%2520Win64;%2520x64;%2520rv:67.0)%2520Gecko/20100101%2520Firefox/67.0 |
https://podcasts.google.com/ | Mozilla/5.0%2520(Windows%2520NT%252010.0;%2520Win64;%2520x64)%2520AppleWebKit/537.36%2520(KHTML,%2520like%2520Gecko)%2520Chrome/74.0.3729.169%2520Safari/537.36 |
https://player.fm/series/podnews-podcasting-news/how-many-podcasts-are-no-longer-being-updated | Mozilla/5.0%2520AppleWebKit/537.36%2520(KHTML,%2520like%2520Gecko;%2520compatible;%2520Googlebot/2.1;%2520+http://www.google.com/bot.html)%2520Safari/537.36 |
http://pca.st/w6GI | Mozilla/5.0%2520AppleWebKit/537.36%2520(KHTML,%2520like%2520Gecko;%2520compatible;%2520Googlebot/2.1;%2520+http://www.google.com/bot.html)%2520Safari/537.36 |
https://ar.radiocut.fm/podcast-episode/audiobooks-pitted-against-movies/ | Mozilla/5.0%2520(Linux;%2520Android%25206.0.1;%2520Nexus%25205X%2520Build/MMB29P)%2520AppleWebKit/537.36%2520(KHTML,%2520like%2520Gecko)%2520Chrome/41.0.2272.96%2520Mobile%2520Safari/537.36%2520(compatible;%2520Googlebot/2.1;%2520+http://www.google.com/bot.html) |
https://podknife.com/tags/fitness-nutrition | Mozilla/5.0%2520(Windows%2520NT%252010.0;%2520Win64;%2520x64)%2520AppleWebKit/537.36%2520(KHTML,%2520like%2520Gecko)%2520Chrome/74.0.3729.169%2520Safari/537.36 |
https://www.gstatic.com/cast/sdk/default_receiver/1.0/app.html?skin=https://chromecast.pocketcasts.com/receiver.css | Mozilla/5.0%2520(X11;%2520Linux%2520armv7l)%2520AppleWebKit/537.36%2520(KHTML,%2520like%2520Gecko)%2520Chrome/73.0.3683.47%2520Safari/537.36%2520CrKey/1.39.154941 |
https://tunein.com/radio/podnews-p1088271/?topicId=129761856 | Mozilla/5.0%2520(Windows%2520NT%252010.0;%2520Win64;%2520x64)%2520AppleWebKit/537.36%2520(KHTML,%2520like%2520Gecko)%2520Chrome/75.0.3770.80%2520Safari/537.36 |
In the above, you'll spot plays from Google Podcasts (web); Apple Podcasts (web) and Player FM (web). You can see someone using PocketCasts to listen via their Chromecast; using TuneIn on the web, and a few others.
Some can be caught with the useragent (Breaker being an obvious example); some can be caught with a referrer alone (like Apple Podcasts web); and some can be caught by both (the Pocketcasts example).
My suggestion might be to add a "referrer" regex, to be used alongside the "useragent" regex, to allow us to correctly attribute these plays to an actual aggregator, rather than lazily attributing them to a browser.
And, yes, we should be catching the Googlebots here and marking them as a "bot".
It is confusing as to why backslashes \
are to be escaped. For example, the regexp ^AppleCoreMedia/1\\..*iPod
doesn't match AppleCoreMedia/1.0.0.16G114 (iPod touch; U; CPU OS 12_4_2 like Mac OS X; en_us)
(this is a user-agent/example pair from the provided json).
The requirement of escaping backslashes forces the user to de-escape them before testing the regexp. Is this the desired usage?
Seems like maybe we should be working together?
https://github.com/PRX/prx-podagent
We are working on implementing IAB's guidance to filter out downloads from Apple Podcasts app watchOS user agent documented here: https://iabtechlab.com/blog/apple-watch-os-podcast-filtering-guidance
We were hoping to match something like `app == "Apple Podcasts" && os = "watchos". As of commit 4ac49c5 though it seems this won't be possible and there does not appear to be a definitive way using the OPAWG list to satisfy the IAB requirement. It allows identifying the watchos but not the Apple Podcasts app specifically.
I'm wondering if there is a more granular regex that would work or do you think this is beyond what is possible with user agents alone, at least at present?
In order for this data to be user-friendly, I'd like to suggest some additional fields for the JSON, to be available to be used in user dashboards.
My suggestions are (all optional):
A suggested example is in the screenshot below.
I might note that "app:" in the current specification is, presumably, intended to be user-facing.
I think this is invalid, quantifier should not be at the beginning of the regex.
user-agents/src/user-agents.json
Line 636 in dcd5d78
I noticed that for Apple Podcasts on macOS Monterrey that the following user agent is used:
AppleCoreMedia/1.0.0.21G83 (Macintosh; U; Intel Mac OS X 12_5_1; en_gb)
In the list it states that AppleCoreMedia should not be treated as Apple Podcasts but in this case it should. I am not at which point Apple changed the UA but it means that platform detection does not work as expected
Thanks!
Firstly, thank you for the fine work!
What do you think of adding the $schema
key to user-agenst.json
. It's value could reference locally, or better yet - point to the json schema store URL, such as https://json.schemastore.org/github-workflow-template-properties.json (for more information, see schema repositories)?
VSCode honors the $schema keyworkd and many schema validators could be used.
I could open a pull-request with the change, but I am reaching for the maintainers opinion to know whether the schema should be published to schema store and use a global URL or use a local reference
From my error logs, there's one non-valid regex pattern in the current JSON. I'm not clever enough to quite work out where. I'll continue that work, but just flagging this as an error.
Hello,
It appears that the spotify bot regex : ^Spotify/\\d+
is not restrictive enough and may match other Soptify user agents that are not related to bots (see here a wide list).
Can we update the Regex to exactly match the Spotify/1.0
user agent, i.e : ^Spotify/\\d+\.\\d+$
?
Any interest in adding more tests to this project?
I hacked together jdelStrother@995af44 that just checks that all the examples
listed in the json match one of the user_agents
regexes.
(half-a-dozen or so seem to have bad example UAs: https://github.com/jdelStrother/user-agents/actions/runs/3439700273/jobs/5737297616).
I didn't bother adding an actual test framework (eg jest) because I struggled to think of other tests I wanted to add, but I could add one if we thought it might expand in usage.
I noticed this in the README:
To stop the list becoming unwieldy, in the future it may be possible to separate out the apps into separate files, that are then combined together automatically.
And thought it sounded fun / fairly simple to implement with Github Actions and tried building a proof-of-concept. Basically the proposed process for adding to the user-agents.json works as follows...
Shorter explanation:
Add user-agent objects into src/organizations// as separate JSON files.
Create a PR to merge the branch with your JSON files in with opawg/user-agents#master
The Github Action should take care of everything else, ultimately resulting in a combined JSON found at dist/user-agent.json :)
Longer explanation:
User-agent objects are added into a src/organizations/ directory. (examples)
When a PR is made to merge your changes into the opawg/user-agents, a Github Action runs the following steps:
The patch version in the package.json is automatically incremented (1.0.0 becomes 1.0.1)
The combine-jsons
command is run from the package.json, which searches the src/organizations directory for all jsons, and combines them into the array in the user-agents.json file in alphabetical order by organization, and sorted by a new priority field within those organizations.
The new combined user-agents.json file is then saved to dist/user-agents.json (the latest version), and dist/archives/<package.json version number>/user-agents.json.
A corresponding user-agents.yaml file is generated and saved in dist/user-agents.yaml and dist/archives/<package.json version number>/user-agents.yaml.
The JSON output files in the dist are then validated using the validate-json-action Github Action.
If all of the previous steps succeed, the last step is the Github Action will automatically push the new JSON and YAML files in the dist file into the branch.
Anyway, all of the steps above can be changed or optimized, I just wanted to get something that accomplishes automatically combining separate files into a single file, with versioning history in case it helps you get started.
If you'd like to see the code, the commit history for the PR is far too messy, so I don't propose merging this in as is, but if this is a direction that you think would be helpful for opawg/user-agents, I would be happy to make create a new cleaner PR with the changes you would like.
Proof-of-Concept PR: https://github.com/podverse/user-agents/pull/1/files
Sample dist folder: https://github.com/podverse/user-agents/tree/autoCombineJSONs
Sample organizations folder: https://github.com/podverse/user-agents/tree/autoCombineJSONs/src/organizations
Thanks for taking the lead on the user agents initiative! It's amazing how just a few lines of code change to podcast apps can make such a big improvement for the podcast ecosystem. Please let me know if there is more I can do to help.
Unless this file can be updated automatically, I'd like to propose deleting it. It's a poor advertisement for this repo.
In the JSON:
{
"user_agents": [
"^doubleTwist CloudPlayer"
],
"examples": [
"doubleTwist CloudPlayer"
],
"app": "doubleTwitch CloudPlayer",
"device": "phone",
"info_url": "https://www.doubletwist.com/cloudplayer",
"os": "android"
},
You probably want app
to be doubleTwist CloudPlayer
.
Friends,
Would it be possible to create releases/version numbers for when the list is updated? That would make it much easier to coordinate updates of gems, such as https://github.com/dan/podcast_agent_parser/ which relies on your fantastic list.
Thanks!
Would be great to have samples of the user agents for each of the regex sections. That would simplify writing tests for the implementation on different platforms.
It would also be helpful in the future when writing updates for missing / new versions since you can compare the newly reported user-agent with the already existing one and adjust the rule to cover both or add an additional.
This record has the wrong key, it has user-agents
instead of user_agents
user-agents/src/user-agents.json
Line 1604 in c5f3ad3
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.