chorsley / python-wappalyzer Goto Github PK
View Code? Open in Web Editor NEWPython driver for Wappalyzer, a web application detection utility.
License: GNU General Public License v3.0
Python driver for Wappalyzer, a web application detection utility.
License: GNU General Public License v3.0
Looks like we had code coverage set up, but it's broken.
I'll see if I can setup codecov instead.
hello, if i wanna change the apps.json and add some features on it how should i do this? when i change the json file then its not working and getting errors
Hi @chorsley,
I'd need you to enable the GitHub page feature, form Settings, and set it to gh-pages
branch, root folder.
Thanks
Since #36, implied technologies that contain the confidence are parsed and if it's equal or greater than 50, it's added to the results. But we do not track this confidence afterwards.
It would be good to have some control about this behaviour in the analyze
method arguments.
Additionally, this feature is only working when using the analyze
method, not other.
Also, I'm not too sure what get_confidence(app_name)
is doing, it's like it returns the sum of confidences got from the technologies.json file regex expressions, for a specified app. But it does not returns the confidence that a certain website has an app. And I think it's buggy since it always returns a empty list to me.
The confidence per technology is a missing feature from this implementation yet
Hello,
Now technologies.json have a new key "dom", when will you support it?
I think we can get the data with:
self.technologies.get(app).get("icon")
I keep getting this warning
/root/anaconda3/lib/python3.7/site-packages/Wappalyzer/Wappalyzer.py:228: UserWarning: Caught 'unbalanced parenthesis at position 119' compiling regex: ['(?:<div class="sf-toolbar[^>]+?>[^]+<span class="sf-toolbar-value">([\\d.])+|<div id="sfwdt[^"]+" class="[^"]*sf-toolbar)', 'version:\\1']
is there any way to disable warnings?
This warning fires occasionally.
/Wappalyzer/Wappalyzer.py:58: MarkupResemblesLocatorWarning: The input looks more like a filename than markup. You may want to open this file and pass the filehandle into Beautiful Soup. self.parsed_html = soup = BeautifulSoup(self.html, 'lxml')
It would be nice to suppress these kinds of BeautifulSoup warnings:
import warnings
from bs4 import UserWarning
warnings.filterwarnings("ignore", category=UserWarning, module='bs4')
Thanks!
I recently encountered an issue where a regex in apps.json encountered catastrophic backtracking, hanging executing of python-Wappalyzer.
For the sake of convenience and speed, I'd suggest it's worth bundling a known good version of apps.json directly in this package. It could be regularly updated as the main Wappalyzer repo is, or alternatively, the user may specify a path to their own apps.json if they want the bleeding edge version.
If this seems acceptable, I'll send a pull request for this.
Hello,
I see some websites with scripts which are included inline like:
<script type="text/plain" ...
without src.
In Wappalyzer.py I found:
self.scripts = [script['src'] for script in soup.findAll('script', src=True)]
so the inline scripts won't be regarded.
When I try to change it into:
self.scripts = soup.findAll('script', type='text/plain', src=True)
no JS is recognized any more.
How would I be able to recognize inline technologies?
Script examples (Mautic + Matomo):
<script type="text/plain" data-cli-class="cli-blocker-script" data-cli-script-type="non-necessary" data-cli-block="true" data-cli-element-position="body">
(function(w,d,t,u,n,a,m){w['MauticTrackingObject']=n;
w[n]=w[n]||function(){(w[n].q=w[n].q||[]).push(arguments)},a=d.createElement(t),
m=d.getElementsByTagName(t)[0];a.async=1;a.src=u;m.parentNode.insertBefore(a,m)
})(window,document,'script','https://www.domain.de/m/mtc.js','mt');
mt('send', 'pageview');
</script>
<!-- Matomo -->
<script type="text/plain" data-cli-class="cli-blocker-script" data-cli-script-type="non-necessary" data-cli-block="true" data-cli-element-position="body">
var _paq = window._paq || [];
/* tracker methods like "setCustomDimension" should be called before "trackPageView" */
_paq.push(['trackPageView']);
_paq.push(['enableLinkTracking']);
(function() {
var u="//stats.domain.de/";
_paq.push(['setTrackerUrl', u+'matomo.php']);
_paq.push(['setSiteId', '1']);
var d=document, g=d.createElement('script'), s=d.getElementsByTagName('script')[0];
g.type='text/javascript'; g.async=true; g.defer=true; g.src=u+'matomo.js'; s.parentNode.insertBefore(g,s);
})();
</script>
<!-- End Matomo Code -->
Hello,
I remarked you began a new dev phase. Thanks for this and if I can help you , please don't hesitate to ask me.
Here is a fork you your repo which implements other methods to parse versions and categories of the detected applications.
http://github.com/bretfourbe/python-Wappalyzer/
Do you think you could merge those feature to master?
Thanks !
eg. wappalyzer.analyze(WebPage.new_from_url('http://msn.com')) throws KeyError: u'IIS;confidence:50'
'DigiCert' fingerprint: { 'cats': [70],
'certIssuer': 'DigiCert',
'icon': 'DigiCert.svg',
'website': 'https://www.digicert.com/'}
JSON schema:
"certIssuer": {
"oneOf": [
{
"type": "array",
"items": {
"$ref": "#/definitions/non-empty-non-blank-string"
}
},
{
"$ref": "#/definitions/non-empty-non-blank-string"
}
]
},
When I use the TOR SOCKS5 proxy via:
import requesocks
session = requesocks.session()session.proxies = {'http': 'socks5://127.0.0.1:9050','https': 'socks5://127.0.0.1:9050'}
print session.get("http://httpbin.org/ip").text
import requests
print requests.get("http://httpbin.org/ip").text
and then I use python-Wappalyzer with
from Wappalyzer import Wappalyzer, WebPage
wappalyzer = Wappalyzer.latest()
webpage = WebPage.new_from_url("http://zqktlwi4fecvo6ri.onion")
webpage = list(wappalyzer.analyze(webpage))
It returns[u'jQuery', u'Nginx', u'Google Tag Manager', u'Underscore.js']
Not too sure why but I've disabled the PyPy tests here bb3c4d2
Would be good to enable it again.
The script fails if the website is not responding quickly enough. Try paxchristi.de
, which is up but slow. It constantly fails on me. Probably a explicit wait
in the right space would help.
I intend to switch the tests to a github action to simplify the maintenance.
Please speak up if you disagree.
Thanks.
I find it useful that it had a cli mode, even so it continues to function as a module
Hi, I was wondering if it automatically follows redirects or not.
It could be useful to add an option to specify if the redirect should be followed or not.
Setup
platform.python_version()
'3.9.1'
version('python-Wappalyzer')
'0.3.1'
Description
I am getting the following error when trying to update the technologies.
wappalyzer = Wappalyzer.latest(update=True)
Traceback (most recent call last):
File "", line 1, in
TypeError: latest() got an unexpected keyword argument 'update'
If I try to load the JSON manually, I get this error.
wappalyzer = Wappalyzer.latest(technologies_file='data/technologies.json')
/Users/knsankar/mydata/el-wap/venv/lib/python3.9/site-packages/Wappalyzer/Wappalyzer.py:226: UserWarning: Caught 'unbalanced parenthesis at position 119' compiling regex: ['(?:<div class="sf-toolbar[^>]+?>[^]+([\d.])+|<div id="sfwdt[^"]+" class="[^"]*sf-toolbar)', 'version:\1']
warnings.warn(
Hey guys,
I'm testing this library, but I'm seeing this error:
/usr/local/lib/python3.8/dist-packages/python_Wappalyzer-0.3.1-py3.8.egg/Wappalyzer/Wappalyzer.py:249: UserWarning: Caught 'unbalanced parenthesis at position 119' compiling regex: ['(?:<div class="sf-toolbar[^>]+?>[^]+<span class="sf-toolbar-value">([\\d.])+|<div id="sfwdt[^"]+" class="[^"]*sf-toolbar)', 'version:\\1']
The regex seems fine, but Python still fails to compile.
I'm testing using python 3.8, current library version (via git) and most resent technologies.json.
Any thoughts ?
Ty!
site is prestashop : https://www.prestashop.com
response: says its Drupal
{'Nginx', 'DoubleClick for Publishers (DFP)', 'Drupal', 'Google Tag Manager', 'PHP', 'Google Font API'}
from Wappalyzer import Wappalyzer, WebPage
import warnings
warnings.filterwarnings("ignore", message="""Caught 'unbalanced parenthesis at position 119' compiling regex""", category=UserWarning )
httpResponse="""
<!doctype html><html><head><meta charset="utf-8"/><meta http-equiv="X-UA-Compatible" content="IE=edge"/></head><body><script src="fingerprinted/js/m-outer-fe96732da72c6a6f4c4db1ff14c37915.js"></script></body></html>
"""
webpage = WebPage("https://js.stripe.com/v3/m-outer-59cdd15d8db95826a41100f00b589171.html",httpRespone,{'test':'test'})
wappalyzer = Wappalyzer.latest()
print(wappalyzer.analyze(webpage))
I have already an HTTP response which is from this URL, and I set it into httpResponse variable then call the WebPage class like this
webpage = WebPage("https://js.stripe.com/v3/m-outer-59cdd15d8db95826a41100f00b589171.html",httpResponse,{'test':'test'})
However, the print(wappalyzer.analyze(webpage)) returns an output set()
UserWarning: Caught 'unbalanced parenthesis at position 119' compiling regex: ['(?:<div class="sf-toolbar[^>]+?>[^]+([\d.])+|<div id="sfwdt[^"]+" class="[^"]*sf-toolbar)', 'version:\1']
Hi @chorsley ,
Could you follow the following process in order to add PyPI key to the github secrets?
In your account settings, go to the API tokens section and select "Add API token"
Then in the github repo settings, add the value to a new secret named "PYPI_TOKEN"
Then the publication of the package should be automatic when new tags are pushed and the tests passes :D
Thank you!
Due to my current situation, it's unlikely I'll have time to examine the pull requests and issues outstanding for some months. If you'd like to volunteer to take over stewardship of this repository, it would be greatly appreciated - please let me know.
I am trying to use the maintained JSON from the Wappalyzer library, but the python code fails with
'list' object has no attribute 'split'
. Any known way to solve this?
Hello Team,
While using Wappalyzer in python3, I am getting this issue while using this dependency. I install WebPage through pip, but still, I am getting this issue.
*Note: When I install Wappalyzer through pip3 it is showing ImportError: cannot import name 'Wappalyzer'. And When I am installing it through GitHub, it gives "ImportError: cannot import name 'WebPage'" message.
Please help me to fix this issue.
Thanks and Regards,
Faiz Ahmed Zaidi
APPS_JSON_URL = 'https://raw.github.com/ElbertF/Wappalyzer/master/share/apps.json' is invalid
address.
Hey,
The module is using the default headers of python when making a request to website i.e. {'Connection': 'keep-alive', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'User-Agent': 'python-requests/2.9.1'}
.
(You can see that here)
Wappalyzer doesn't work on sites which block Python's deafult User-Agent. Can you introduce a new function which allows users to add custom headers ?
This issue is a feature request
The technologies are now dispatched into several files: https://github.com/AliasIO/wappalyzer/tree/master/src/technologies
This line is failing with a 404: https://github.com/chorsley/python-Wappalyzer/blob/master/Wappalyzer/Wappalyzer.py#L246
We can simply get all files and merge them into a single data structure.
There is a gap between the version available in PyPI and the GitHub repository.
Could you please update PyPI to support the latest version of the GitHub repository?
The last package version was on September 2020
self = <Wappalyzer.Wappalyzer.Wappalyzer object at 0x7fe4d83caf70>, pattern = ['Abicart', 'Textalk Webshop']
def _prepare_pattern(self, pattern):
"""
Strip out key:value pairs from the pattern and compile the regular
expression.
"""
attrs = {}
> pattern = pattern.split('\\;')
E AttributeError: 'list' object has no attribute 'split'
/usr/local/lib/python3.8/site-packages/Wappalyzer/Wappalyzer.py:219: AttributeError
Last good wappalyzer commit 842f18aa86faaaf523e9b98e5419e213b40302ca
Broken wappalyzer commit e3bf786826318160f4016b206f5dcd9853c50da0
Following code work with python3.9 but correctly warns about a bad regex in python3.11:
from Wappalyzer import Wappalyzer, WebPage
WPL = Wappalyzer.latest()
webpage = WebPage.new_from_url(url)
web_record = WPL.analyze_with_versions_and_categories(webpage)
Trying to run this with python3.11 on " http://yahoo.com" I get:
.../python3.11/site-packages/Wappalyzer/Wappalyzer.py:226: UserWarning: Caught 'unbalanced parenthesis at position 119' compiling regex:
['(?:<div class="sf-toolbar[^>]+?>[^]+<span class="sf-toolbar-value">([\\d.])+|<div id="sfwdt[^"]+" class="[^"]*sf-toolbar)', 'version:\\1']
----------------------------------^^^ invalid?
The 'position 119' seems to a delayed reaction to the core issue.
Indeed it looks like the sub-regex: [^]+ just before is invalid since ^ is a negation/complement for the char-class which is empty here.
The problem is in the data-file:
Wappalyzer/data/technologies.json (towards the end, technologies are alphabetically sorted)
The rule for "Symfony": "html": should be (one char change):
"html": "(?:<div class=\"sf-toolbar[^>]+?>[^<]+<span class=\"sf-toolbar-value\">([\\d.])+|<div id=\"sfwdt[^\"]+\" class=\"[^\"]*sf-toolbar)\\;version:\\1",
------------------------------------------^^^^ the fix
Fixed in this PR:
#80
/usr/local/lib/python3.8/dist-packages/Wappalyzer/Wappalyzer.py:226: UserWarning: Caught 'unbalanced parenthesis at position 119' compiling regex: ['(?:<div class="sf-toolbar[^>]+?>[^]+([\d.])+|<div id="sfwdt[^"]+" class="[^"]*sf-toolbar)', 'version:\1']
warnings.warn(
{'Azure CDN', 'Am
when i use update=True
wappalyzer = Wappalyzer.latest(update=True)
Could not download latest Wappalyzer technologies.json file because of error : '[Errno Extra data] 404: Not Found: 3'. Using default.
File "python-Wappalyzer/Wappalyzer/Wappalyzer.py", line 270, in latest
logger.info("Using technologies.json file at {}".format(_technologies_file.as_posix()))
UnboundLocalError: local variable '_technologies_file' referenced before assignment
We should implement an argument to the Wappalyzer.latest
method to auto-update and use the latest technologies.json file:
wappalyzer=Wappalyzer.latest(update=True)
See #30
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.