Git Product home page Git Product logo

hashtheplanet's Issues

Reduce the size of the database

Currently, the database is 378MB. This seems huge considering that it theoretically only contains hashes, versions and filenames.

A fast investigation revealed that:

  • The database contains hashes for useless filenames. Some files are inside the intellij .idea folder and knowning their hash is not usefull. Same for all files in test-related folders. A request in the database revealed that these test files account for at least 40% of all files. Some files also ends with the .php extension. There is no chance that these files will be usefull to detect a version. Php files represents 50% of all files.
  • The versions are stored in json, which is quite verbose. All hashes seems to correspond to a continuous range of versions. Replacing the json field by two fields 'initial_version' and 'last_version' could greatly reduce the size of these data. The range could then be computed using the version table. The best would be to have a version table with ordered versions. By limiting to numerical versions (other are useless because Cyberwatch cannot find associated CVE), the ordering can be alphanumeric.
  • The hashes are stored as string, using 64 bytes instead of 8 bytes. As there are 600 000 hashes, converting these strings to binary could save up to 33MB.
  • The table versions seems useless, but I may be wrong. There are only 1600 entries so this is not very important.
  • The name of the technology is stored in each row in tables hash and file. Each of these entry use 8 bytes. Adding a table technology and using foreign keys of a small size (u32 for example) can save some space.

Action required

  • Filter filenames added to the database with some heuristic, removing useless files.
  • Store only numerical versions (or versions like '2.5.0-beta'), beginning with a number.
  • Replace the json with two small strings representing the range of versions this hash was present.
  • Convert hashes to binary. (it may be impossible due to the fact sqlite does not support binary data because of NULL bytes).

Faire une tache qui permet de releaser la db pleine

L'objectif du projet est de mettre à disposition une base de données sqlite avec les hash des fichiers.

L'objectif de cette issue est de release quotidienne une version à jour de cette base.

Si la tache est raisonnable en terme de temps, utiliser les gthub action pour lancer cette tache.

Si la tache est trop longue, voir pour avoir un script que l'on peut lancer sur un serveur autonome pour délivrer une nouvelle version une fois par jour.

Inconsistency in database

While working on improving the htp module on Wapiti ( wapiti-scanner/wapiti#344 ), I noticed several inconsistencies in the hashtheplanet database.

What happens is that a version appears in the hash table but doesn't have its counterpart in the version table

sqlite> select count(*) from hash where versions like "%4.0.0-alpha4x%" and technology = "WordPress";
202
sqlite> select count(*) from version where technology = "WordPress" and version = "4.0.0-alpha4x";
0

This is particularily true with the aforementioned version that appears with a lot of hashes (I cut the output):

GET https://blog.logrocket.com/wp-includes/js/tinymce/license.txt (0) led to technology ('magento2', '"{\\"versions\\": [\\"2.3.0\\", \\"2.3.1\\", \\"2.3.2\\", \\"2.3.2-p2\\", \\"2.3.3\\", \\"2.3.3-p1\\", \\"2.3.4\\", \\"2.3.4-p2\\", \\"2.3.5\\", \\"2.3.5-p1\\", \\"2.3.5-p2\\", \\"2.3.6\\", \\"2.3.6-p1\\", \\"2.3.7\\", \\"2.3.7-p1\\", \\"2.3.7-p2\\", \\"2.3.7-p3\\", \\"2.3.7-p4\\", \\"2.4.0\\", \\"2.4.0-p1\\", \\"2.4.1\\", \\"2.4.1-p1\\", \\"2.4.2\\", \\"2.4.2-p1\\", \\"2.4.2-p2\\", \\"2.4.3\\", \\"2.4.3-p1\\", \\"2.4.3-p2\\", \\"2.4.3-p3\\", \\"2.4.4\\", \\"4.0.0-alpha1\\", \\"4.0.0-alpha10\\", \\"4.0.0-alpha11\\", \\"4.0.0-alpha12\\", \\"4.0.0-alpha2\\", \\"4.0.0-alpha3\\", \\"4.0.0-alpha4\\", \\"4.0.0-alpha4x\\"]}"')

GET https://blog.logrocket.com/wp-includes/js/mediaelement/mediaelementplayer.css (0) led to technology ('joomla-cms', '"{\\"versions\\": [\\"4.0.0-alpha4x\\"]}"')

GET https://blog.logrocket.com/wp-includes/sodium_compat/src/Core/Curve25519/README.md (0) led to technology ('WordPress', '"{\\"versions\\": [\\"5.2\\", \\"3.10.0\\", \\"3.10.0-alpha1\\", \\"3.10.0-alpha2\\",  \\"4.0.0\\", \\"4.0.0-alpha1\\", \\"4.0.0-alpha10\\", \\"4.0.0-alpha11\\", \\"4.0.0-alpha12\\", \\"4.0.0-alpha2\\", \\"4.0.0-alpha3\\", \\"4.0.0-alpha4\\", \\"4.0.0-alpha4x\\", \\"4.0.0-alpha5\\", \\"4.0.0-alpha6\\", \\"psr12anchor\\"]}"')


GET https://blog.logrocket.com/wp-content/themes/twentytwentytwo/templates/blank.html (0) led to technology ('underscore', '"{\\"versions\\": [\\"1.12.1\\", \\"1.13.0-0\\", \\"1.13.0-2\\", \\"1.13.0-1\\", \\"8.0-alpha10\\", \\"8.0-alpha11\\", \\"8.0-alpha12\\", \\"8.0-alpha13\\", \\"8.0-alpha2\\", \\"8.0-alpha3\\", \\"8.0-alpha4\\", \\"8.0-alpha5\\", \\"8.0-alpha6\\", \\"8.0-alpha7\\", \\"8.0-alpha8\\",  \\"4.0.0\\", \\"4.0.0-alpha1\\", \\"4.0.0-alpha10\\", \\"4.0.0-alpha11\\", \\"4.0.0-alpha12\\", \\"4.0.0-alpha2\\", \\"4.0.0-alpha3\\", \\"4.0.0-alpha4\\", \\"4.0.0-alpha4x\\", \\"4.0.0-alpha5\\", \\"4.0.0-alpha6\\", \\"4.0.0-alpha7\\", \\"4.0.0-alpha8\\", \\"4.0.0-alpha9\\", \\"4.0.0-beta\\", \\"4.0.0-beta2\\", \\"4.0.0-beta3\\", \\"4.0.0-beta4\\", \\"4.0.0-beta5\\", \\"4.0.0-beta6\\", \\"4.0.0-beta7\\", \\"4.0.0-rc1\\", \\"psr12anchor\\", \\"psr12final\\", \\"search1\\"]}"')

Only the joomla-cms entry is relevant because that tag is specific to Joomla: https://github.com/joomla/joomla-cms/releases/tag/4.0.0-alpha4x

It is the same problem with tags psr12anchor and psr12final and certainly more.

Also some hashes should maybe be blacklisted because they match files that can be found in a lot of software like (in the previous output) :

  • a file with a single empty line (blank.html)
  • the default LGPL licence file

Those invalid version numbers certainly have an impact on the database size (issue #28 )

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.