sggottlieb / cmfieldguide Goto Github PK
View Code? Open in Web Editor NEWService to identify the CMS behind a website.
License: The Unlicense
Service to identify the CMS behind a website.
License: The Unlicense
Let's get the .wsgi file into source control under an apache directory.
I want this so I can edit it and pull in server environment variables.
I thought it would be nice to store the results in a database and show them as "Recent Detections" on the landing page.
We could also use "Recent Detections" for caching.
The logic would be...
If we have a high confidence of a site, we could list the site and the platform that we think it is. The list would be reverse chronological.
What do you think? Is this a "Beta" feature or a "1.0" feature?
Maybe we can float the idea on Karla and have her think it through visually?
For consistency, what are we going to call HTML returned from a URL? A "page"? And the "url" is just the string that starts "http://..."
If so, then this seems wrong:
url_exists
Of course the URL exists -- it's just a text string, and it's passed in. Shouldn't it be "page_exists"?
To restrict the application to "friends and family," I think we should just use basic auth in Apache and distribute a shared username and password. No need to clutter the application with temporary functionality.
What do you think?
This code is a problem:
if self.page_contains_pattern(self.get_url_stem(url) + '/wp-login.php', 'loginform'):
I tried it on deanebarker.net and it didn't work. Then I realized that my WordPress blog is at deanebarker.net/blog, which means the login page is deanebarker.net/blog/wp-login.php.
We need to account for CMS installs not at the root of the site. More generally, do we accept that if the submitter enters a URL with a path on it, that he wants to check for the CMS at that specific path, not necessarily for the site as a whole?
When someone submits a URL, the app gets the page in a form validation process. Then it gets the page again when it starts the tests.
We should save the results of the first get.
Possibly we could run the tests as part of the clean method and pass the home page into the engine rather than the URL.
This method will use the expose_php easter egg (credits page) to tell if a page is definitely being served by PHP.
Here is the information behind:
Seth,
Your MODX Signature detection is invalid. There are no signatures output by MODX that are standard to an install. The only possible measure is the number of people using the default location of the manger directory and since it can actually live outside webroot and be renamed on install this will not work. MODX.com uses cache busting variables on the css in order to manage versions and ensure css changes get propogated to browsers but this is a design decision and not a function of MODX.
We've had others wanting to collect similar in the past and Netcraft was only able to get approximate counts as we don't have any tracing or metric data on individual websites. This is an architectural decision to obscure version and app version. In fact it would be trivial to make MODX mimic many of the output and path derived sigs on this cmfieldguide.
BeautifulSoup has a strange flaw where the tag name argument of findAll() is "name." This is unfortunate because "name" is also a very common attribute (like in anchor and input). The result is that Beautiful Soup throws an exception because it is getting the name argument twice.
This bit me on the Joomla! signature which was using for an input tag with a specific name attribute. I temporarily fixed this by looking for another attribute.
The more permanent fix is to strip out a name attribute from the dictionary of attributes and then match the name attribute in a loop on the result set from the initial findAll query.
Would be nice to make a page that shows data like identification rate and number of signatures over time.
Should we put something on the footer like a disclaimer and/or terms of use? I am thinking that this is for entertainment purposes only and that these are just guesses.
Should we reference the github project?
Look for style sheets and image sources with demandware.edgesuite.net
I can do this on the Wiki
Look for src=.imageg.net in style sheet links, javascript links, and image sources.
What is it going to be? GPL? BSD? Apache?
The issue is that the test is not looking for a login form. Instead, it is just seeing if there is a page. The Ektron 404 page just happens to return a status code of 200. Go figure. I would check the contents of the page and see if there is something that looks like the EPiServer login.
To determine whether Adobe CQ is being utilized you can test for the following paths in the requests or the HTML content:
/etc/designs/* (CSS and JS typically here)
/etc/*
/libs/cq/*
/libs/wcm/*
/content/dam/* (DAM images, videos, pdf's typically here)
/content/* (ambiguous, but pages/nodes are typically here)
You can test this on any of the GM sites (gm.com, cadillac, gmc, etc.)
This method should return a 1 if we know for sure that the site is running .NET. If we are not sure, we return a 0. If we are sure it is not running .NET (don't know how), let's return a -1.
This can be used by Signatures for CMS that run on Java, PHP, or Python to quickly disqualify.
We discussed a couple of methods:
Add test for /~/.../*.asmx
Look for ?css= at the root of the site.
[base-url]/.magnolia
Will either return the loginform: for instance,
http://www.magnolia-cms.com/.magnolia
Or return a 403: http://www.mbc.net/.magnolia
Or redirect to the homepage: http://www.navy.com/.magnolia
...but will not return a 404.
Another dead giveaway is if the source for pages contains:
/.imaging/stk/
/resources/templating-kit/
And for older versions:
/magnoliaAssets/
For Hippo CMS a signature is still needed.
Should return a dictionary of headers.
You can use httplib2. See here:
http://stackoverflow.com/questions/843392/python-get-http-headers-from-urllib-call
I have added httplib2 to the req.txt
To be polite, we need a custom user agent string with some indications of what this request is doing. People might accuse us of assholery, otherwise.
go to /admin and get redirected to /Security/login?BackURL=%2Fadmin
Look for an input with id="MemberLoginForm_LoginForm_Email"
I don't think util.py or management are being used anymore. Could you verify and remove if not needed?
Look for "prebuilt" in either a scripts or style sheet paths.
Look for header of ""X-Umbraco-Version"
I wonder if we should locally cache retrieved URLs in a dictionary:
{ url, html }
Given a multiplying number of tests, we could be pulling the same pages over and over...
Should not look for the word Umbraco on the login page. A 404 page may contain the word Umbraco because it is part of the URL.
It might provide some incentive to developers if you create a page listing all the support CMSes with signatures and a description of it.
Seems like Ektron sites have CSS in /Workarea
Or create another page.
look for css?v={number}
Perhaps friends at Molecular or Siteworx could help.
Pages end in .dot extension. Look for it in the URL or, if the URL ends in '/', add a 'index.dot' to the end and verity that it is the same page.
To detect any octropress site that doesn't have a completely-custom theme (not sure if those are detectable), look for the javascript file: /javascripts/octopress.js
The application will have the following models
Seems like we could create a method called has_css_link which takes an argument for a pattern that is in the path.
For example:
has_css_link('/workarea', case_sensitive=False)
This would be more targeted that has_pattern. Perhaps we could use BeautifulSoup to parse the HTML and look for a link tag?
There has to be something.
Tell me about /signatures/init.py.
When is this code run?
The has_matching_tag method returns true if any of the attribute value regexes match. This gives a lot of false positives.
Should only return true if all of the attribute regexes match.
As a result, everything is coming back as an Ektron site because that is looking for:
{ 'rel': 'stylesheet', 'link': '/workarea' }
BTW, would be good to write unit tests for this.
This is a test.
I have been thinking about how the is_dot_net_webforms and has_php_credits (to be implemented) skews the confidence score. What do you think if we add an overridable "STACK" or "LANGUAGE" or "TECHNOLOGY" constant the BaseSignature.
On "run" we rule out the other stacks before going through any of the tests. For example,
Our drupal signature would have:
STACK = "PHP"
Then in the run method, before the tests we would have:
if self.STACK != ".NET" and is_dot_net_webforms(url):
#return a result that says something like "this cannot be %s because this site is built with .NET technology."
if self.STACK != 'PHP' and has_php_credits(url):
#return a result that says something like "this cannot be %s because this site is built with PHP technology"
What do you think?
Look for wcs/stores/servlet in any of the following:
The URL that the user submitted
geturl() off of urllib2 in the case of a serverside redirect.
Look for
Style%20Library
We could really use some ui work. Possibly Ajax submission. Also see ticket about showing the web server info.
It would be nice if page_contains_pattern could take either a string (a single pattern) or a list (multiple patterns). If a list, it would test for any of the patterns. This would avoid multiple calls to the same method, if you have a number of patterns you want to check and you don't really care which one matches, just so long as one of them does.
I'm thinking of this while I'm writing this code:
pattern = 'id="aspnetform"'
if page_contains_pattern(url, pattern):
result = True
pattern = 'ct100_'
if page_contains_pattern(url, pattern):
result = True
I don't care which one of these returns, so long as one of them does.
But, with no static typing, how do you do this? Do you just check inside the method to see if the argument is a string or a list? I know this would work, but I just want to make sure I do it Pythonically.
say the web server and whether or not the site is built in PHP or .NET.
Idea came from @spilth. Assigning to @deanebarker because should be incorporated into larger design initiative.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.