chorsley / python-wappalyzer Goto Github PK

View Code? Open in Web Editor NEW

306.0 306.0 120.0 833 KB

Python driver for Wappalyzer, a web application detection utility.

License: GNU General Public License v3.0

Python 98.94% Makefile 1.06%

python-wappalyzer's People

Contributors

Stargazers

Watchers

Forkers

jbarratt offlinehacker eon01 ryandub danieljurek atcazzual gmoncarz lnxg33k 790854836 petermosmans abdelhai s03d4-164 dhruv-minewhat yuhuating varshit97 tr33oph aviabrams itimky yurikroz npc7 phpplay space-pirate pirate-sapce sachuin23 shuisman phoogerheide zhonghaoling ofogel secureideasllc theinnovator c-goes amua motleycrew ismael-safadi honeybot cclauss khasmek hien ahmetkurukose ldbfpiaoran hland ma5onic drag0nr3b0rn artsturdevant imfht ryshoooo liuyoyi firasslimane igumeni afernandezb92 grayguest creamypandaxx reznok greedycoco aisahe greg-wu andreisobolev byt3bl33d3r jt-secret-project-2 s0rtega hollow667 tristanlatr dogasantos romalash icysun yume96 everping odhiamboobuya baharuddinzulkifli silentsoul04 parshuramreddysudda cyrusradfar litian093488 yyz zanachka jeromeyoung adevoil iofane jadore147258369 asdlei99 hyc-1234 honeyakshat999 anshumansrivastavagit bawed mhmh55516 4mmcat erickrex digits88 vaar-tool antbean brandonscholet arielf ctrl-felix necron3574 ezzeasy tntwkuf hirargb 07h huitail6 scamscannet

python-wappalyzer's Issues

Setup code coverage

Looks like we had code coverage set up, but it's broken.

I'll see if I can setup codecov instead.

how to change apps.json

hello, if i wanna change the apps.json and add some features on it how should i do this? when i change the json file then its not working and getting errors

Enable GitHub pages

Hi @chorsley,

I'd need you to enable the GitHub page feature, form Settings, and set it to gh-pages branch, root folder.

Thanks

Improve the implied technologies management and confidence

Since #36, implied technologies that contain the confidence are parsed and if it's equal or greater than 50, it's added to the results. But we do not track this confidence afterwards.
It would be good to have some control about this behaviour in the analyze method arguments.

Additionally, this feature is only working when using the analyze method, not other.

Also, I'm not too sure what get_confidence(app_name) is doing, it's like it returns the sum of confidences got from the technologies.json file regex expressions, for a specified app. But it does not returns the confidence that a certain website has an app. And I think it's buggy since it always returns a empty list to me.

The confidence per technology is a missing feature from this implementation yet

dom key in tenhonologies.json

Hello,

Now technologies.json have a new key "dom", when will you support it?

create a Wappalyzer.get_icon(name) method

I think we can get the data with:

self.technologies.get(app).get("icon")

how to disable annoying warning

I keep getting this warning

/root/anaconda3/lib/python3.7/site-packages/Wappalyzer/Wappalyzer.py:228: UserWarning: Caught 'unbalanced parenthesis at position 119' compiling regex: ['(?:<div class="sf-toolbar[^>]+?>[^]+<span class="sf-toolbar-value">([\\d.])+|<div id="sfwdt[^"]+" class="[^"]*sf-toolbar)', 'version:\\1']

is there any way to disable warnings?

Suppress BeautifulSoup Warnings

This warning fires occasionally.

/Wappalyzer/Wappalyzer.py:58: MarkupResemblesLocatorWarning: The input looks more like a filename than markup. You may want to open this file and pass the filehandle into Beautiful Soup. self.parsed_html = soup = BeautifulSoup(self.html, 'lxml')

It would be nice to suppress these kinds of BeautifulSoup warnings:

import warnings
from bs4 import UserWarning
warnings.filterwarnings("ignore", category=UserWarning, module='bs4')

Thanks!

Suggestion: include apps.json in package

I recently encountered an issue where a regex in apps.json encountered catastrophic backtracking, hanging executing of python-Wappalyzer.

For the sake of convenience and speed, I'd suggest it's worth bundling a known good version of apps.json directly in this package. It could be regularly updated as the main Wappalyzer repo is, or alternatively, the user may specify a path to their own apps.json if they want the bleeding edge version.

If this seems acceptable, I'll send a pull request for this.

Recognize scripts which are inline HTML and not "active" due to GDPR

Hello,

I see some websites with scripts which are included inline like:
<script type="text/plain" ...
without src.

In Wappalyzer.py I found:
self.scripts = [script['src'] for script in soup.findAll('script', src=True)]

so the inline scripts won't be regarded.
When I try to change it into:
self.scripts = soup.findAll('script', type='text/plain', src=True)
no JS is recognized any more.

How would I be able to recognize inline technologies?

Script examples (Mautic + Matomo):

<script type="text/plain" data-cli-class="cli-blocker-script"  data-cli-script-type="non-necessary" data-cli-block="true"  data-cli-element-position="body">
    (function(w,d,t,u,n,a,m){w['MauticTrackingObject']=n;
        w[n]=w[n]||function(){(w[n].q=w[n].q||[]).push(arguments)},a=d.createElement(t),
        m=d.getElementsByTagName(t)[0];a.async=1;a.src=u;m.parentNode.insertBefore(a,m)
    })(window,document,'script','https://www.domain.de/m/mtc.js','mt');

    mt('send', 'pageview');
</script>

<!-- Matomo -->
<script type="text/plain" data-cli-class="cli-blocker-script"  data-cli-script-type="non-necessary" data-cli-block="true"  data-cli-element-position="body">
  var _paq = window._paq || [];
  /* tracker methods like "setCustomDimension" should be called before "trackPageView" */
  _paq.push(['trackPageView']);
  _paq.push(['enableLinkTracking']);
  (function() {
    var u="//stats.domain.de/";
    _paq.push(['setTrackerUrl', u+'matomo.php']);
    _paq.push(['setSiteId', '1']);
    var d=document, g=d.createElement('script'), s=d.getElementsByTagName('script')[0];
    g.type='text/javascript'; g.async=true; g.defer=true; g.src=u+'matomo.js'; s.parentNode.insertBefore(g,s);
  })();
</script>
<!-- End Matomo Code -->

Merging featured fork

Hello,

I remarked you began a new dev phase. Thanks for this and if I can help you , please don't hesitate to ask me.

Here is a fork you your repo which implements other methods to parse versions and categories of the detected applications.
http://github.com/bretfourbe/python-Wappalyzer/

Do you think you could merge those feature to master?

Thanks !

Key Error on apps with implies: "Foobar\\;confidence:50"

eg. wappalyzer.analyze(WebPage.new_from_url('http://msn.com')) throws KeyError: u'IIS;confidence:50'

Understand 'certIssuer' field

'DigiCert' fingerprint: {   'cats': [70],
    'certIssuer': 'DigiCert',
    'icon': 'DigiCert.svg',
    'website': 'https://www.digicert.com/'}

JSON schema:

"certIssuer": {
        "oneOf": [
          {
            "type": "array",
            "items": {
              "$ref": "#/definitions/non-empty-non-blank-string"
            }
          },
          {
            "$ref": "#/definitions/non-empty-non-blank-string"
          }
        ]
      },

Error while connecting with TOR SOCKS5 Proxy

When I use the TOR SOCKS5 proxy via:

import requesocks
session = requesocks.session()

session.proxies = {'http': 'socks5://127.0.0.1:9050','https': 'socks5://127.0.0.1:9050'}

print session.get("http://httpbin.org/ip").text

import requests

print requests.get("http://httpbin.org/ip").text

and then I use python-Wappalyzer with

from Wappalyzer import Wappalyzer, WebPage

wappalyzer = Wappalyzer.latest()

webpage = WebPage.new_from_url("http://zqktlwi4fecvo6ri.onion")

webpage = list(wappalyzer.analyze(webpage))
It returns

[u'jQuery', u'Nginx', u'Google Tag Manager', u'Underscore.js']

Test with PyPy in CI

Not too sure why but I've disabled the PyPy tests here bb3c4d2

Would be good to enable it again.

Fails if website has slow response rate

The script fails if the website is not responding quickly enough. Try paxchristi.de, which is up but slow. It constantly fails on me. Probably a explicit wait in the right space would help.

Switch to github actions

I intend to switch the tests to a github action to simplify the maintenance.

Please speak up if you disagree.

Thanks.

feature request - cli mode

I find it useful that it had a cli mode, even so it continues to function as a module

Does it follow redirects?

Hi, I was wondering if it automatically follows redirects or not.
It could be useful to add an option to specify if the redirect should be followed or not.

Fail to add custom headers

Hey,
The wappalyzer module is saying that we can add our own custom headers, but i am not able to add headers. See below screen shorts.

Error when try to update

Setup

platform.python_version()
'3.9.1'

version('python-Wappalyzer')
'0.3.1'

Description
I am getting the following error when trying to update the technologies.

wappalyzer = Wappalyzer.latest(update=True)
Traceback (most recent call last):
File "", line 1, in
TypeError: latest() got an unexpected keyword argument 'update'

If I try to load the JSON manually, I get this error.

wappalyzer = Wappalyzer.latest(technologies_file='data/technologies.json')
/Users/knsankar/mydata/el-wap/venv/lib/python3.9/site-packages/Wappalyzer/Wappalyzer.py:226: UserWarning: Caught 'unbalanced parenthesis at position 119' compiling regex: ['(?:<div class="sf-toolbar[^>]+?>[^]+([\d.])+|<div id="sfwdt[^"]+" class="[^"]*sf-toolbar)', 'version:\1']
warnings.warn(

Several technologies are not detecting

Here are results from wappallyzer and python-wappalyzer for https://avilpage.com

Upgrade to BeautifulSoup 4

unbalanced parenthesis

Hey guys,
I'm testing this library, but I'm seeing this error:

/usr/local/lib/python3.8/dist-packages/python_Wappalyzer-0.3.1-py3.8.egg/Wappalyzer/Wappalyzer.py:249: UserWarning: Caught 'unbalanced parenthesis at position 119' compiling regex: ['(?:<div class="sf-toolbar[^>]+?>[^]+<span class="sf-toolbar-value">([\\d.])+|<div id="sfwdt[^"]+" class="[^"]*sf-toolbar)', 'version:\\1']

The regex seems fine, but Python still fails to compile.
I'm testing using python 3.8, current library version (via git) and most resent technologies.json.

Any thoughts ?
Ty!

Wrong response returned

site is prestashop : https://www.prestashop.com

response: says its Drupal

{'Nginx', 'DoubleClick for Publishers (DFP)', 'Drupal', 'Google Tag Manager', 'PHP', 'Google Font API'}

wappalyzer.analyze(webpage) returns a `set()` output??

from Wappalyzer import Wappalyzer, WebPage
import warnings
warnings.filterwarnings("ignore", message="""Caught 'unbalanced parenthesis at position 119' compiling regex""", category=UserWarning )

httpResponse="""
<!doctype html><html><head><meta charset="utf-8"/><meta http-equiv="X-UA-Compatible" content="IE=edge"/></head><body><script src="fingerprinted/js/m-outer-fe96732da72c6a6f4c4db1ff14c37915.js"></script></body></html>
"""

webpage = WebPage("https://js.stripe.com/v3/m-outer-59cdd15d8db95826a41100f00b589171.html",httpRespone,{'test':'test'})
wappalyzer = Wappalyzer.latest()
print(wappalyzer.analyze(webpage))

I have already an HTTP response which is from this URL, and I set it into httpResponse variable then call the WebPage class like this

webpage = WebPage("https://js.stripe.com/v3/m-outer-59cdd15d8db95826a41100f00b589171.html",httpResponse,{'test':'test'})

However, the print(wappalyzer.analyze(webpage)) returns an output set()

There is an issue with the regular expression used.

UserWarning: Caught 'unbalanced parenthesis at position 119' compiling regex: ['(?:<div class="sf-toolbar[^>]+?>[^]+([\d.])+|<div id="sfwdt[^"]+" class="[^"]*sf-toolbar)', 'version:\1']

Add PyPI API token to github secrets

Hi @chorsley ,

Could you follow the following process in order to add PyPI key to the github secrets?

In your account settings, go to the API tokens section and select "Add API token"
Then in the github repo settings, add the value to a new secret named "PYPI_TOKEN"

Then the publication of the package should be automatic when new tags are pushed and the tests passes :D

Thank you!

Looking for replacement maintainer

Due to my current situation, it's unlikely I'll have time to examine the pull requests and issues outstanding for some months. If you'd like to volunteer to take over stewardship of this repository, it would be greatly appreciated - please let me know.

Updating to a new technologies.json file from the node Wappalyzer, fails with 'list' object has no attribute 'split'

I am trying to use the maintained JSON from the Wappalyzer library, but the python code fails with
'list' object has no attribute 'split'. Any known way to solve this?

ImportError: cannot import name 'WebPage'

Hello Team,

While using Wappalyzer in python3, I am getting this issue while using this dependency. I install WebPage through pip, but still, I am getting this issue.

*Note: When I install Wappalyzer through pip3 it is showing ImportError: cannot import name 'Wappalyzer'. And When I am installing it through GitHub, it gives "ImportError: cannot import name 'WebPage'" message.

Please help me to fix this issue.

Thanks and Regards,
Faiz Ahmed Zaidi

Is this repo DEAD ?!

The command wappalyzer.latest() doesn't work

APPS_JSON_URL = 'https://raw.github.com/ElbertF/Wappalyzer/master/share/apps.json' is invalid
address.

Add custom header to request

Hey,

The module is using the default headers of python when making a request to website i.e. {'Connection': 'keep-alive', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'User-Agent': 'python-requests/2.9.1'}.
(You can see that here)

Wappalyzer doesn't work on sites which block Python's deafult User-Agent. Can you introduce a new function which allows users to add custom headers ?

This issue is a feature request

Fix the `update=True` feature

The technologies are now dispatched into several files: https://github.com/AliasIO/wappalyzer/tree/master/src/technologies

This line is failing with a 404: https://github.com/chorsley/python-Wappalyzer/blob/master/Wappalyzer/Wappalyzer.py#L246

We can simply get all files and merge them into a single data structure.

Outdated version of the package in PyPI

There is a gap between the version available in PyPI and the GitHub repository.
Could you please update PyPI to support the latest version of the GitHub repository?
The last package version was on September 2020

AttributeError: 'list' object has no attribute 'split'

self = <Wappalyzer.Wappalyzer.Wappalyzer object at 0x7fe4d83caf70>, pattern = ['Abicart', 'Textalk Webshop']

    def _prepare_pattern(self, pattern):
        """
        Strip out key:value pairs from the pattern and compile the regular
        expression.
        """
        attrs = {}
>       pattern = pattern.split('\\;')
E       AttributeError: 'list' object has no attribute 'split'

/usr/local/lib/python3.8/site-packages/Wappalyzer/Wappalyzer.py:219: AttributeError

Last good wappalyzer commit 842f18aa86faaaf523e9b98e5419e213b40302ca

Broken wappalyzer commit e3bf786826318160f4016b206f5dcd9853c50da0

Broken line https://github.com/AliasIO/wappalyzer/commit/e3bf786826318160f4016b206f5dcd9853c50da0#diff-ad6b5aefcc8c0ba3d7c9f4c9ba096652f7bae1235363a40bba4759f3abb374c4R16897

Invalid regex in Wappalyzer/data/technologies.json: Symfony: html

Following code work with python3.9 but correctly warns about a bad regex in python3.11:

   from Wappalyzer import Wappalyzer, WebPage
   WPL = Wappalyzer.latest()
   webpage = WebPage.new_from_url(url)
   web_record = WPL.analyze_with_versions_and_categories(webpage)

Trying to run this with python3.11 on " http://yahoo.com" I get:

.../python3.11/site-packages/Wappalyzer/Wappalyzer.py:226: UserWarning: Caught 'unbalanced parenthesis at position 119' compiling regex:

['(?:<div class="sf-toolbar[^>]+?>[^]+<span class="sf-toolbar-value">([\\d.])+|<div id="sfwdt[^"]+" class="[^"]*sf-toolbar)', 'version:\\1']
----------------------------------^^^ invalid?

The 'position 119' seems to a delayed reaction to the core issue.

Indeed it looks like the sub-regex: [^]+ just before is invalid since ^ is a negation/complement for the char-class which is empty here.

The problem is in the data-file:
Wappalyzer/data/technologies.json (towards the end, technologies are alphabetically sorted)

The rule for "Symfony": "html": should be (one char change):

"html": "(?:<div class=\"sf-toolbar[^>]+?>[^<]+<span class=\"sf-toolbar-value\">([\\d.])+|<div id=\"sfwdt[^\"]+\" class=\"[^\"]*sf-toolbar)\\;version:\\1",
------------------------------------------^^^^ the fix

Fixed in this PR:
#80

UserWarning: Caught 'unbalanced parenthesis

/usr/local/lib/python3.8/dist-packages/Wappalyzer/Wappalyzer.py:226: UserWarning: Caught 'unbalanced parenthesis at position 119' compiling regex: ['(?:<div class="sf-toolbar[^>]+?>[^]+([\d.])+|<div id="sfwdt[^"]+" class="[^"]*sf-toolbar)', 'version:\1']
warnings.warn(
{'Azure CDN', 'Am

UnboundLocalError: local variable '_technologies_file' referenced before assignment

when i use update=True
wappalyzer = Wappalyzer.latest(update=True)

Could not download latest Wappalyzer technologies.json file because of error : '[Errno Extra data] 404: Not Found: 3'. Using default. 
  File "python-Wappalyzer/Wappalyzer/Wappalyzer.py", line 270, in latest
    logger.info("Using technologies.json file at {}".format(_technologies_file.as_posix()))
UnboundLocalError: local variable '_technologies_file' referenced before assignment

Auto-update to latest technologies.json file

We should implement an argument to the Wappalyzer.latest method to auto-update and use the latest technologies.json file:

wappalyzer=Wappalyzer.latest(update=True)

See #30

chorsley / python-wappalyzer Goto Github PK

python-wappalyzer's People

Contributors

Stargazers

Watchers

Forkers

python-wappalyzer's Issues

Recommend Projects

Recommend Topics

Recommend Org