Git Product home page Git Product logo

python-wappalyzer's People

Contributors

atcazzual avatar aviabrams avatar bretfourbe avatar byt3bl33d3r avatar cclauss avatar chorsley avatar claymation avatar jbarratt avatar petermosmans avatar romalash avatar shuisman avatar tristanlatr avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

python-wappalyzer's Issues

Setup code coverage

Looks like we had code coverage set up, but it's broken.

I'll see if I can setup codecov instead.

how to change apps.json

hello, if i wanna change the apps.json and add some features on it how should i do this? when i change the json file then its not working and getting errors

Enable GitHub pages

Hi @chorsley,

I'd need you to enable the GitHub page feature, form Settings, and set it to gh-pages branch, root folder.

Thanks

Improve the implied technologies management and confidence

Since #36, implied technologies that contain the confidence are parsed and if it's equal or greater than 50, it's added to the results. But we do not track this confidence afterwards.
It would be good to have some control about this behaviour in the analyze method arguments.

Additionally, this feature is only working when using the analyze method, not other.

Also, I'm not too sure what get_confidence(app_name) is doing, it's like it returns the sum of confidences got from the technologies.json file regex expressions, for a specified app. But it does not returns the confidence that a certain website has an app. And I think it's buggy since it always returns a empty list to me.

The confidence per technology is a missing feature from this implementation yet

how to disable annoying warning

I keep getting this warning

/root/anaconda3/lib/python3.7/site-packages/Wappalyzer/Wappalyzer.py:228: UserWarning: Caught 'unbalanced parenthesis at position 119' compiling regex: ['(?:<div class="sf-toolbar[^>]+?>[^]+<span class="sf-toolbar-value">([\\d.])+|<div id="sfwdt[^"]+" class="[^"]*sf-toolbar)', 'version:\\1']

is there any way to disable warnings?

Suppress BeautifulSoup Warnings

This warning fires occasionally.

/Wappalyzer/Wappalyzer.py:58: MarkupResemblesLocatorWarning: The input looks more like a filename than markup. You may want to open this file and pass the filehandle into Beautiful Soup. self.parsed_html = soup = BeautifulSoup(self.html, 'lxml')

It would be nice to suppress these kinds of BeautifulSoup warnings:

import warnings
from bs4 import UserWarning
warnings.filterwarnings("ignore", category=UserWarning, module='bs4')

Thanks!

Suggestion: include apps.json in package

I recently encountered an issue where a regex in apps.json encountered catastrophic backtracking, hanging executing of python-Wappalyzer.

For the sake of convenience and speed, I'd suggest it's worth bundling a known good version of apps.json directly in this package. It could be regularly updated as the main Wappalyzer repo is, or alternatively, the user may specify a path to their own apps.json if they want the bleeding edge version.

If this seems acceptable, I'll send a pull request for this.

Recognize scripts which are inline HTML and not "active" due to GDPR

Hello,

I see some websites with scripts which are included inline like:
<script type="text/plain" ...
without src.

In Wappalyzer.py I found:
self.scripts = [script['src'] for script in soup.findAll('script', src=True)]

so the inline scripts won't be regarded.
When I try to change it into:
self.scripts = soup.findAll('script', type='text/plain', src=True)
no JS is recognized any more.

How would I be able to recognize inline technologies?

Script examples (Mautic + Matomo):

<script type="text/plain" data-cli-class="cli-blocker-script"  data-cli-script-type="non-necessary" data-cli-block="true"  data-cli-element-position="body">
    (function(w,d,t,u,n,a,m){w['MauticTrackingObject']=n;
        w[n]=w[n]||function(){(w[n].q=w[n].q||[]).push(arguments)},a=d.createElement(t),
        m=d.getElementsByTagName(t)[0];a.async=1;a.src=u;m.parentNode.insertBefore(a,m)
    })(window,document,'script','https://www.domain.de/m/mtc.js','mt');

    mt('send', 'pageview');
</script>

<!-- Matomo -->
<script type="text/plain" data-cli-class="cli-blocker-script"  data-cli-script-type="non-necessary" data-cli-block="true"  data-cli-element-position="body">
  var _paq = window._paq || [];
  /* tracker methods like "setCustomDimension" should be called before "trackPageView" */
  _paq.push(['trackPageView']);
  _paq.push(['enableLinkTracking']);
  (function() {
    var u="//stats.domain.de/";
    _paq.push(['setTrackerUrl', u+'matomo.php']);
    _paq.push(['setSiteId', '1']);
    var d=document, g=d.createElement('script'), s=d.getElementsByTagName('script')[0];
    g.type='text/javascript'; g.async=true; g.defer=true; g.src=u+'matomo.js'; s.parentNode.insertBefore(g,s);
  })();
</script>
<!-- End Matomo Code -->

Merging featured fork

Hello,

I remarked you began a new dev phase. Thanks for this and if I can help you , please don't hesitate to ask me.

Here is a fork you your repo which implements other methods to parse versions and categories of the detected applications.
http://github.com/bretfourbe/python-Wappalyzer/

Do you think you could merge those feature to master?

Thanks !

Understand 'certIssuer' field

'DigiCert' fingerprint: {   'cats': [70],
    'certIssuer': 'DigiCert',
    'icon': 'DigiCert.svg',
    'website': 'https://www.digicert.com/'}

JSON schema:

"certIssuer": {
        "oneOf": [
          {
            "type": "array",
            "items": {
              "$ref": "#/definitions/non-empty-non-blank-string"
            }
          },
          {
            "$ref": "#/definitions/non-empty-non-blank-string"
          }
        ]
      },

Error while connecting with TOR SOCKS5 Proxy

When I use the TOR SOCKS5 proxy via:

import requesocks
session = requesocks.session()

session.proxies = {'http': 'socks5://127.0.0.1:9050','https': 'socks5://127.0.0.1:9050'}

print session.get("http://httpbin.org/ip").text

import requests

print requests.get("http://httpbin.org/ip").text

and then I use python-Wappalyzer with

from Wappalyzer import Wappalyzer, WebPage

wappalyzer = Wappalyzer.latest()

webpage = WebPage.new_from_url("http://zqktlwi4fecvo6ri.onion")

webpage = list(wappalyzer.analyze(webpage))
It returns

[u'jQuery', u'Nginx', u'Google Tag Manager', u'Underscore.js']

Fails if website has slow response rate

The script fails if the website is not responding quickly enough. Try paxchristi.de, which is up but slow. It constantly fails on me. Probably a explicit wait in the right space would help.

Switch to github actions

I intend to switch the tests to a github action to simplify the maintenance.

Please speak up if you disagree.

Thanks.

Does it follow redirects?

Hi, I was wondering if it automatically follows redirects or not.
It could be useful to add an option to specify if the redirect should be followed or not.

Fail to add custom headers

Hey,
The wappalyzer module is saying that we can add our own custom headers, but i am not able to add headers. See below screen shorts.
github-issue-wapp
github-issue-wapp2

Error when try to update

Setup

platform.python_version()
'3.9.1'

version('python-Wappalyzer')
'0.3.1'

Description
I am getting the following error when trying to update the technologies.

wappalyzer = Wappalyzer.latest(update=True)
Traceback (most recent call last):
File "", line 1, in
TypeError: latest() got an unexpected keyword argument 'update'

If I try to load the JSON manually, I get this error.

wappalyzer = Wappalyzer.latest(technologies_file='data/technologies.json')
/Users/knsankar/mydata/el-wap/venv/lib/python3.9/site-packages/Wappalyzer/Wappalyzer.py:226: UserWarning: Caught 'unbalanced parenthesis at position 119' compiling regex: ['(?:<div class="sf-toolbar[^>]+?>[^]+([\d.])+|<div id="sfwdt[^"]+" class="[^"]*sf-toolbar)', 'version:\1']
warnings.warn(

unbalanced parenthesis

Hey guys,
I'm testing this library, but I'm seeing this error:

/usr/local/lib/python3.8/dist-packages/python_Wappalyzer-0.3.1-py3.8.egg/Wappalyzer/Wappalyzer.py:249: UserWarning: Caught 'unbalanced parenthesis at position 119' compiling regex: ['(?:<div class="sf-toolbar[^>]+?>[^]+<span class="sf-toolbar-value">([\\d.])+|<div id="sfwdt[^"]+" class="[^"]*sf-toolbar)', 'version:\\1']

The regex seems fine, but Python still fails to compile.
I'm testing using python 3.8, current library version (via git) and most resent technologies.json.

Any thoughts ?
Ty!

wappalyzer.analyze(webpage) returns a `set()` output??

from Wappalyzer import Wappalyzer, WebPage
import warnings
warnings.filterwarnings("ignore", message="""Caught 'unbalanced parenthesis at position 119' compiling regex""", category=UserWarning )

httpResponse="""
<!doctype html><html><head><meta charset="utf-8"/><meta http-equiv="X-UA-Compatible" content="IE=edge"/></head><body><script src="fingerprinted/js/m-outer-fe96732da72c6a6f4c4db1ff14c37915.js"></script></body></html>
"""

webpage = WebPage("https://js.stripe.com/v3/m-outer-59cdd15d8db95826a41100f00b589171.html",httpRespone,{'test':'test'})
wappalyzer = Wappalyzer.latest()
print(wappalyzer.analyze(webpage))

I have already an HTTP response which is from this URL, and I set it into httpResponse variable then call the WebPage class like this

webpage = WebPage("https://js.stripe.com/v3/m-outer-59cdd15d8db95826a41100f00b589171.html",httpResponse,{'test':'test'})

However, the print(wappalyzer.analyze(webpage)) returns an output set()

image

Add PyPI API token to github secrets

Hi @chorsley ,

Could you follow the following process in order to add PyPI key to the github secrets?

In your account settings, go to the API tokens section and select "Add API token"
Then in the github repo settings, add the value to a new secret named "PYPI_TOKEN"

Then the publication of the package should be automatic when new tags are pushed and the tests passes :D

Thank you!

Looking for replacement maintainer

Due to my current situation, it's unlikely I'll have time to examine the pull requests and issues outstanding for some months. If you'd like to volunteer to take over stewardship of this repository, it would be greatly appreciated - please let me know.

ImportError: cannot import name 'WebPage'

Hello Team,

While using Wappalyzer in python3, I am getting this issue while using this dependency. I install WebPage through pip, but still, I am getting this issue.

image

*Note: When I install Wappalyzer through pip3 it is showing ImportError: cannot import name 'Wappalyzer'. And When I am installing it through GitHub, it gives "ImportError: cannot import name 'WebPage'" message.

Please help me to fix this issue.

Thanks and Regards,
Faiz Ahmed Zaidi

Add custom header to request

Hey,

The module is using the default headers of python when making a request to website i.e. {'Connection': 'keep-alive', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'User-Agent': 'python-requests/2.9.1'}.
(You can see that here)

Wappalyzer doesn't work on sites which block Python's deafult User-Agent. Can you introduce a new function which allows users to add custom headers ?

This issue is a feature request

Outdated version of the package in PyPI

There is a gap between the version available in PyPI and the GitHub repository.
Could you please update PyPI to support the latest version of the GitHub repository?
The last package version was on September 2020

AttributeError: 'list' object has no attribute 'split'

self = <Wappalyzer.Wappalyzer.Wappalyzer object at 0x7fe4d83caf70>, pattern = ['Abicart', 'Textalk Webshop']

    def _prepare_pattern(self, pattern):
        """
        Strip out key:value pairs from the pattern and compile the regular
        expression.
        """
        attrs = {}
>       pattern = pattern.split('\\;')
E       AttributeError: 'list' object has no attribute 'split'

/usr/local/lib/python3.8/site-packages/Wappalyzer/Wappalyzer.py:219: AttributeError

Last good wappalyzer commit 842f18aa86faaaf523e9b98e5419e213b40302ca

Broken wappalyzer commit e3bf786826318160f4016b206f5dcd9853c50da0

Broken line https://github.com/AliasIO/wappalyzer/commit/e3bf786826318160f4016b206f5dcd9853c50da0#diff-ad6b5aefcc8c0ba3d7c9f4c9ba096652f7bae1235363a40bba4759f3abb374c4R16897

Invalid regex in Wappalyzer/data/technologies.json: Symfony: html

Following code work with python3.9 but correctly warns about a bad regex in python3.11:

   from Wappalyzer import Wappalyzer, WebPage
   WPL = Wappalyzer.latest()
   webpage = WebPage.new_from_url(url)
   web_record = WPL.analyze_with_versions_and_categories(webpage)

Trying to run this with python3.11 on " http://yahoo.com" I get:

.../python3.11/site-packages/Wappalyzer/Wappalyzer.py:226: UserWarning: Caught 'unbalanced parenthesis at position 119' compiling regex:

['(?:<div class="sf-toolbar[^>]+?>[^]+<span class="sf-toolbar-value">([\\d.])+|<div id="sfwdt[^"]+" class="[^"]*sf-toolbar)', 'version:\\1']
----------------------------------^^^ invalid?

The 'position 119' seems to a delayed reaction to the core issue.

Indeed it looks like the sub-regex: [^]+ just before is invalid since ^ is a negation/complement for the char-class which is empty here.

The problem is in the data-file:
Wappalyzer/data/technologies.json (towards the end, technologies are alphabetically sorted)

The rule for "Symfony": "html": should be (one char change):

"html": "(?:<div class=\"sf-toolbar[^>]+?>[^<]+<span class=\"sf-toolbar-value\">([\\d.])+|<div id=\"sfwdt[^\"]+\" class=\"[^\"]*sf-toolbar)\\;version:\\1",
------------------------------------------^^^^ the fix

Fixed in this PR:
#80

UserWarning: Caught 'unbalanced parenthesis

/usr/local/lib/python3.8/dist-packages/Wappalyzer/Wappalyzer.py:226: UserWarning: Caught 'unbalanced parenthesis at position 119' compiling regex: ['(?:<div class="sf-toolbar[^>]+?>[^]+([\d.])+|<div id="sfwdt[^"]+" class="[^"]*sf-toolbar)', 'version:\1']
warnings.warn(
{'Azure CDN', 'Am

UnboundLocalError: local variable '_technologies_file' referenced before assignment

when i use update=True
wappalyzer = Wappalyzer.latest(update=True)

Could not download latest Wappalyzer technologies.json file because of error : '[Errno Extra data] 404: Not Found: 3'. Using default. 
  File "python-Wappalyzer/Wappalyzer/Wappalyzer.py", line 270, in latest
    logger.info("Using technologies.json file at {}".format(_technologies_file.as_posix()))
UnboundLocalError: local variable '_technologies_file' referenced before assignment

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.