swanandx / lemmeknow Goto Github PK

View Code? Open in Web Editor NEW

856.0 856.0 36.0 642 KB

The fastest way to identify anything!

Home Page: https://docs.rs/lemmeknow/

License: MIT License

Rust 100.00%

cli cryptography cybersecurity pywhat regex rust rust-crate rust-lang

lemmeknow's Introduction

Hi there 👋, I'm Swanand (swanandx)

Hacker / Developer

struct Swanand;

impl Swanand {
    fn whoami() -> &'static str {
      "Hey 👋, I'm swanandx and I'm a hacker who loves to build cool stuff"
    }
}

impl Developer for Swanand {
    fn code() -> &'static str {
        r#"I build _blazingly fast_ tools for cyber security,
           create games, websites, apps etc.

           Languages - Rust, Python, C, C++, JS/TS, Assembly/WebAssembly
           Technologies - Actix, Yew, React, GoDot, Flutter
        "#
    }
}

impl Hacker for Swanand {
    fn hack() -> &'static str {
        r#"I love to hack on TryHackMe and play various CTFs.
           Apart from that, I make software to help others!
           
           RE & PWN is <3
        "#
    }
}

Support:

lemmeknow's People

Contributors

Stargazers

Watchers

lemmeknow's Issues

Add support for filtering output

We have entries for Rarity and Tags in database. We need to implement a filter so that user can filter based on rarity and/or tags.

For example, lemmeknow --rarity 0.2:0.6 --tags Credentials TEXT.

making it a module will be nice idea 😄 src/output/filter.rs

Phone number regex problem

Detecting a phone number for any country is no easy task. Some Python libs are doing a great job at it using mix of regex and Machine Learning. Current implementation in lemmeknow is creating lot of obvious false positives:

rewrite regex without using look-around

Following regex won't compile as regex crate doesn't support look-around.

If possible, can we rewrite them in such a way that they don't use look-around??

Update them in src/data/regex.json

PS: You can run following command in repo if you want to see exact syntax error of regex ( or just use regex crate to compile them one by one )

cargo test validate_regex_examples -- --show-output

Use languages that compile to RegEx for regexes

Would it makes sense to use languages like pomsky and Melody to compile the regexes? I think it would make it easier to maintain the list of Regexes, make it easier to contribute new Regexes and probably also make them easier to test, and to read.

calculate min and max length for regex, if any

TryHackMe flag will be min 3 ( as it must have thm in it ) and there is no max limit.

YouTube Video ID will be 11 characters long, ( n92YrzELBJU )

We need a list with this for every regex.

if there is no fixed length, put *
if min but no max, use 3-*
can only be 8 or 10, use 8/10
exact length 11, use 11

Many regex identify text which have fixed size range, if we filter based on it first, we might optimize our algorithm.
Suggestions are welcomed for other ways to optimize.

typo in regex.json

lemmeknow/src/data/regex.json

Line 2242 in 79bbb92

 "Exploit": "There is a change this could be a Google Maps API key, so could try using 'gmapapiscanner'[1] or 'gap'[2]\nto check which Google Maps service it is valid for and generate a PoC that you can show in your report. To\nget a better understanding on the severity of having the Google Maps API key exposed, make sure to to to\nread \"Unauthorized Google Maps API Key Usage Cases, and Why You Need to Care\"[3] written by Ozgur Alp (@ozguralp)\n\nReferences:\n [1] https://github.com/ozguralp/gmapsapiscanner\n [2] https://github.com/joanbono/gap\n [3] https://ozguralp.medium.com/unauthorized-google-maps-api-key-usage-cases-and-why-you-need-to-care-1ccb28bf21e\n\nAPI Documentation: https://developers.google.com/maps/documentation/javascript/get-api-key", 

change → chance

There are no unit tests

I am debugging if my program is broken or if LemmeKnow is broken, there is no unit tests in LemmeKnow so I cannot prove it works. Please add unit tests for the API :)

Use `&str` for tags

    /// Only include the Data which have at least one of the specified `tags`
    pub tags: Vec<String>,
    /// Only include Data which doesn't have any of the `excluded_tags`
    pub exclude_tags: Vec<String>,

String is a growable buffer, we do not expect them to grow so we should use str

AWS EC2 Instance ID regex problem

False positive:

Improve regex for URL

The current URL regex produces false positives. Probably can be improved.

benchmark latest code with v0.8

as regex 1.9 speeds up use of regexes in threads, this should boost our perf as well.

we need to benchmark lemmeknow latest with v0.8.

need some tests to validate the regexes from JSON file

The regex.json file also have Examples for some regexes,

{
      "Name": "Capture The Flag (CTF) Flag",
      "Regex": "(?i)^(flag\\{.*\\}|ctf\\{.*\\}|ctfa\\{.*\\})$",
      "plural_name": false,
      "Description": null,
      "Rarity": 1,
      "URL": null,
      "Tags": [
         "CTF Flag"
      ],
      "Examples": {
         "Valid": [
            "FLAG{hello}"
         ],
         "Invalid": []
      }
   },

We need to check if the regex is matching those examples correctly to validate it! For that, we can create a file under tests like lemmeknow/tests/validate_regexes.rs and just parse the JSON file and validate it there.

Allow querying the online lemmeknow by URL

When opening the lemmeknow webpage with a URL such as this:
https://swanandx.github.io/lemmeknow-frontend/?q=search+term
It should use this as input and try to figure out what the search term could be.

Why?

Browser Search engines.

With this feature you could register lemmeknow as a search engine for your browser. You could (for example) use the alias lmk to search lemmeknow.

Then typing lmk dQw4w9WgXcQ would lead you to find out what the ID stands for.

proposal: use `SmallVec` instead of `Vec` for buffer

The idea is to use smallvec for buffer while extracting strings from file.
We only consider the strings which are longer than 4 characters, so for other strings, which we are going to reject anyway, we can avoid heap allocation caused due to buffer vector. here.

let mut buffer: SmallVec<[u8; 4]> = smallvec![]; // TODO: change 4 to more optimal number

So, what to do?

Change Vec to SmallVec
Experiment with different sizes ( at least 4 will be good imho, but we should try other using quasi-doubling strategy , i.e. 4, 8, 16, 32 )
Benchmark the code to see if it actually improves performance.
Choose the one with best performance. ( post the results here or while making PR )

It would be amazing if you can post benchmark of all sizes, then we can choose the most optimal.

Configure release profile

All we need to do is add following at the end of our Cargo.toml file:

[profile.release]
lto = true
panic = "abort"

That is it!

AWS org id regex problem

One more regex that seems too loose:

Add Doctests

There is currently no doctests for the API and it is confusing me to read them :( It'd be lovely to have some!

Can I be a contributor pls?

Looking at making a few PRs. Can you make me a contributor please?

Unable to identify base64

Hi, I'm trying lemmeknow and I gotta say that it works quite well with links (e.g. YouTube channels, wallets and so on), but it misses to detect the easiest things. For example it is unable to recognize a base64 encoded text.

Nice work though.

Show Exploits in cli output

We have Exploit for some identifications. It would be great if we could show them if user passed -v i.e. verbose flag on cli.

{
      "Name": "Mailchimp API Key",
      ...
      "Description": null,
      "Exploit": "Use the command below to verify that the API key is valid (substitute <dc> for your datacenter, i. e. us5):\n  $ curl --request GET --url 'https://<dc>.api.mailchimp.com/3.0/' --user 'anystring:API_KEY_HERE' --include\n",
      "Rarity": 0.8,
      "URL": null,
     ...
   },

use `onig` crate for matching on strings

We can use onig crate instead of regex crate for strings API. We can't fully replace regex crate as onig doesn't provide a way to match on bytes not it compiles for wasm32 ( it might, but that will be lot of work ).

Suggestion

Use onig for matching on strings ( not on wasm32 target )
Use regex for bytes and strings for wasm32

Pros

onig takes performance from 130ms to 28ms!! It's blazingly fast

Cons

Variance in performance for matching on strings and bytes for API users, because we have no other choice than using regex crate for bytes
Extra dependency

Workaround

We can make a feature called bytes for enabling bytes support, that way users can explicitly opt for adding regex crate as dependency and be aware of slower ( comparatively ) performace.

regex vs onig crate benchmark!

Identifies JWT as LTC/Ripple/BCH-Wallet-Adress

Input: eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiIxMjM0NTY3ODkwIiwibmFtZSI6IkpvaG4gRG9lIiwiaWF0IjoxNTE2MjM5MDIyfQ.SflKxwRJSMeKKF2QT4fwpMeJf36POk6yJV_adQssw5c

Output:
Found Possible Identifications :)

Matched text	Identified as	Description
`eyJhbGciOiJIUzI1..._adQssw5c`	Litecoin (LTC) Wallet Address	URL: https://live.blockcypher.com/ltc/address/eyJhbGciOiJIUzI1..._adQssw5c

Expected "Identified as" would be JWT.
Shortened the JWT for better visibility.

Another example:

Input: eyJhbGciOiJIUzI1NiJ9.eyJmb28iOiJiYXIifQ.JTvQIxZOL_-00JdKfTAEmhV-a6KUlB6OUWM8NuN7MN8
Output: No Possible Identifications :(

Expected "Identified as" would be JWT.

Crypto addresses regex problem

I am pretty sure Github as reliable regexes for many cryptoaddresses that would prevent these false positives:

Add some benchmarks!

This project uses same regex database as PyWhat , we need some performance benchmarks against it <3 !

For identifying single text
For analyzing strings from a file
Calling function multiple times through API i.e. lemmeknow::what_is("text here") for lemmeknow and Identifier.identify("text") for pyWhat.

API documentations available here 😸 :-

Use bytes instead of strings, ditch fancy_regex for regex crate

Currently lemmeknow uses the fancy_regex crate for matching regex. The problem is that it doesn't support bytes. The regex crate, however does: https://docs.rs/regex/1.0.0/regex/bytes/index.html

If there is no reason to use fancy_regex then we should switch. Both pyWhat and lemmeknow only support ASCII strings. We need to support UTF-8, UTF-16, etc. and bytes.

See this equivalent pyWhat issue: bee-san/pyWhat#34

IPv6 Regex problem

It would be fun is this was a valid IPv6.

URL is not parsed with params :(

$ pywhat https://google.com?pageId=102013
zsh: no matches found: https://google.com?pageId=102013