Git Product home page Git Product logo

cmfieldguide's People

Contributors

jreijn avatar markusgiesen avatar sclarson avatar sggottlieb avatar stevenbrent avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

cmfieldguide's Issues

Put .wsgi file in source control

Let's get the .wsgi file into source control under an apache directory.

I want this so I can edit it and pull in server environment variables.

Recent Detections

I thought it would be nice to store the results in a database and show them as "Recent Detections" on the landing page.

We could also use "Recent Detections" for caching.

The logic would be...

If we have a high confidence of a site, we could list the site and the platform that we think it is. The list would be reverse chronological.

What do you think? Is this a "Beta" feature or a "1.0" feature?

Maybe we can float the idea on Karla and have her think it through visually?

Page vs. Url?

For consistency, what are we going to call HTML returned from a URL? A "page"? And the "url" is just the string that starts "http://..."

If so, then this seems wrong:

url_exists

Of course the URL exists -- it's just a text string, and it's passed in. Shouldn't it be "page_exists"?

Friends and Family Restriction

To restrict the application to "friends and family," I think we should just use basic auth in Apache and distribute a shared username and password. No need to clutter the application with temporary functionality.

What do you think?

Should we just be checking the site as a whole or the submitted URL with path?

This code is a problem:

if self.page_contains_pattern(self.get_url_stem(url) + '/wp-login.php', 'loginform'):

I tried it on deanebarker.net and it didn't work. Then I realized that my WordPress blog is at deanebarker.net/blog, which means the login page is deanebarker.net/blog/wp-login.php.

We need to account for CMS installs not at the root of the site. More generally, do we accept that if the submitter enters a URL with a path on it, that he wants to check for the CMS at that specific path, not necessarily for the site as a whole?

Fix 'double-pump' of URL submission

When someone submits a URL, the app gets the page in a form validation process. Then it gets the page again when it starts the tests.

We should save the results of the first get.

Possibly we could run the tests as part of the clean method and pass the home page into the engine rather than the URL.

MODX Signature Invalid

Seth,

Your MODX Signature detection is invalid. There are no signatures output by MODX that are standard to an install. The only possible measure is the number of people using the default location of the manger directory and since it can actually live outside webroot and be renamed on install this will not work. MODX.com uses cache busting variables on the css in order to manage versions and ensure css changes get propogated to browsers but this is a design decision and not a function of MODX.

We've had others wanting to collect similar in the past and Netcraft was only able to get approximate counts as we don't have any tracing or metric data on individual websites. This is an architectural decision to obscure version and app version. In fact it would be trivial to make MODX mimic many of the output and path derived sigs on this cmfieldguide.

Fix bug in has_matching_tag

BeautifulSoup has a strange flaw where the tag name argument of findAll() is "name." This is unfortunate because "name" is also a very common attribute (like in anchor and input). The result is that Beautiful Soup throws an exception because it is getting the name argument twice.

This bit me on the Joomla! signature which was using for an input tag with a specific name attribute. I temporarily fixed this by looking for another attribute.

The more permanent fix is to strip out a name attribute from the dictionary of attributes and then match the name attribute in a loop on the result set from the initial findAll query.

Build out reporting

Would be nice to make a page that shows data like identification rate and number of signatures over time.

How about a disclaimer, terms of use on the bottom?

Should we put something on the footer like a disclaimer and/or terms of use? I am thinking that this is for entertainment purposes only and that these are just guesses.

Should we reference the github project?

Add GSI Commerce

Look for src=.imageg.net in style sheet links, javascript links, and image sources.

License?

What is it going to be? GPL? BSD? Apache?

EPiServer is returning a false positive on Ektron.com

The issue is that the test is not looking for a login form. Instead, it is just seeing if there is a page. The Ektron 404 page just happens to return a status code of 200. Go figure. I would check the contents of the page and see if there is something that looks like the EPiServer login.

Adobe CQ

To determine whether Adobe CQ is being utilized you can test for the following paths in the requests or the HTML content:
/etc/designs/* (CSS and JS typically here)
/etc/*
/libs/cq/*
/libs/wcm/*
/content/dam/* (DAM images, videos, pdf's typically here)
/content/* (ambiguous, but pages/nodes are typically here)

You can test this on any of the GM sites (gm.com, cadillac, gmc, etc.)

Add is_dot_net method to BaseSignature

This method should return a 1 if we know for sure that the site is running .NET. If we are not sure, we return a 0. If we are sure it is not running .NET (don't know how), let's return a -1.

This can be used by Signatures for CMS that run on Java, PHP, or Python to quickly disqualify.

We discussed a couple of methods:

  • view_state
  • Resource.axd
  • ct100_

User Agent String

To be polite, we need a custom user agent string with some indications of what this request is doing. People might accuse us of assholery, otherwise.

SilverStripe Signature

go to /admin and get redirected to /Security/login?BackURL=%2Fadmin

Look for an input with id="MemberLoginForm_LoginForm_Email"

Clear out unused code

I don't think util.py or management are being used anymore. Could you verify and remove if not needed?

Local Page Cache

I wonder if we should locally cache retrieved URLs in a dictionary:

{ url, html }

Given a multiplying number of tests, we could be pulling the same pages over and over...

Create page listing supported CMSes

It might provide some incentive to developers if you create a page listing all the support CMSes with signatures and a description of it.

DotCMS

Pages end in .dot extension. Look for it in the URL or, if the URL ends in '/', add a 'index.dot' to the end and verity that it is the same page.

Octopress

To detect any octropress site that doesn't have a completely-custom theme (not sure if those are detectable), look for the javascript file: /javascripts/octopress.js

Implement Data Models

The application will have the following models

  • Site (This is the site that the visitor enters. We will store a row whenever a user requests this unless the user has requested it within the last 24 hours. Otherwise we will pull it from cache.)
    • url
    • html (save off the page that we pull down)
    • title (from the title tag)
    • response_code (the HTTP response code returned)
    • date_time (the date time that the visitor requested)
  • PlatformTestResult (this is the result of a test whether a site is running a platform)
    • site (FK to site)
    • platform (name of the platform from signature.name)
    • confidence
    • visitor_rejects (true if the visitor says "this is impossible")

has_css_link method on Page class

Seems like we could create a method called has_css_link which takes an argument for a pattern that is in the path.

For example:

has_css_link('/workarea', case_sensitive=False)

This would be more targeted that has_pattern. Perhaps we could use BeautifulSoup to parse the HTML and look for a link tag?

bug in has_matching_tag

The has_matching_tag method returns true if any of the attribute value regexes match. This gives a lot of false positives.

Should only return true if all of the attribute regexes match.

As a result, everything is coming back as an Ektron site because that is looking for:

{ 'rel': 'stylesheet', 'link': '/workarea' }

BTW, would be good to write unit tests for this.

Test

This is a test.

Different approach to using is_dot_net_webforms and has_php_credits

I have been thinking about how the is_dot_net_webforms and has_php_credits (to be implemented) skews the confidence score. What do you think if we add an overridable "STACK" or "LANGUAGE" or "TECHNOLOGY" constant the BaseSignature.

On "run" we rule out the other stacks before going through any of the tests. For example,

Our drupal signature would have:

STACK = "PHP" 

Then in the run method, before the tests we would have:

if self.STACK != ".NET" and is_dot_net_webforms(url):
    #return a result that says something like "this cannot be %s because this site is built with .NET technology."

if self.STACK != 'PHP' and has_php_credits(url):
    #return a result that says something like "this cannot be %s because this site is built with PHP technology"

What do you think?

Signature for Websphere Commerce

Look for wcs/stores/servlet in any of the following:

The URL that the user submitted

geturl() off of urllib2 in the case of a serverside redirect.

Improve branding

We could really use some ui work. Possibly Ajax submission. Also see ticket about showing the web server info.

How do you overload a function Pythonically?

It would be nice if page_contains_pattern could take either a string (a single pattern) or a list (multiple patterns). If a list, it would test for any of the patterns. This would avoid multiple calls to the same method, if you have a number of patterns you want to check and you don't really care which one matches, just so long as one of them does.

I'm thinking of this while I'm writing this code:

    pattern = 'id="aspnetform"'
    if page_contains_pattern(url, pattern):
        result = True

    pattern = 'ct100_'
    if page_contains_pattern(url, pattern):
        result = True

I don't care which one of these returns, so long as one of them does.

But, with no static typing, how do you do this? Do you just check inside the method to see if the argument is a string or a list? I know this would work, but I just want to make sure I do it Pythonically.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.