certcc / labyrinth Goto Github PK

View Code? Open in Web Editor NEW

93.0 93.0 35.0 22.45 GB

Come inside, and have a nice cup of tea.

License: Other

labyrinth's People

Contributors

Stargazers

Watchers

labyrinth's Issues

Change how modulus is computed

labyrinth/labyrinth/repo_processor.py

Line 201 in 207dbce

mods = _df["id"].apply(lambda x: x % divisor)

This line uses the repo id and a modulus to decide how to split repos across parallel runs of the script. The problem is that sometimes individual runs can fail repeatedly, meaning that the same block of repos never gets worked on.

We can't just randomize it, because then we will have more than one process handling a repo.

So I'm thinking we need to add in some other factor that is constant for an individual run, but changes between runs.
Could be hour of the day, or maybe there's some run ID that can be converted to an int? The former can come from within the Python code directly, whereas the latter might require modification to the workflow scripts, unless there is some environment variable already there for the python code to use.

Repo deep dive action is failing

See for example https://github.com/CERTCC/labyrinth/actions/runs/6536547242

Prepare all required actions
Run ./.github/actions/deep_dive
Run repo_deep_dive --verbose --mod 5 --divisor 10 --results_dir results/[2](https://github.com/CERTCC/labyrinth/actions/runs/6536547242/job/17749913960#step:7:2)023/10/15 --max_age 7200
INFO root - log level: INFO
INFO labyrinth.repo_processor - Reading 1 search result summaries
INFO labyrinth.repo_processor - Found 13 search results to process
INFO labyrinth.repo_processor - Cloning https://github.com/ExploitRc3/ExploitRc3.git 1 of 13
INFO labyrinth.repo_processor - Cloning https://github.com/codingcore12/Extremely-Silent-JPG-Exploit-NEW-nk.git 2 of 13
INFO labyrinth.repo_processor - Cloning https://github.com/PrasoonPratham/Simple-XSS-exploit-example.git 3 of 13
INFO labyrinth.repo_processor - Cloning https://github.com/Pyr0sec/CVE-2023-38646.git 4 of 13
INFO labyrinth.file_processor - Found 1 matches in 1 out of 3 files
INFO labyrinth.repo_processor - Cloning https://github.com/iotwar/AntiQbot.git 5 of 13
INFO labyrinth.repo_processor - Cloning https://github.com/Latrodect/EATER-offensive-security-frameowork.git 6 of 13
INFO labyrinth.repo_processor - Cloning https://github.com/Anthony-T-N/CTF-Binary-Exploitation.git 7 of 13
Traceback (most recent call last):
  File "/opt/hostedtoolcache/Python/3.9.18/x64/bin/repo_deep_dive", line 66, in <module>
    process_modulo(args.results_dir, args.mod, args.divisor)
  File "/opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/labyrinth/repo_processor.py", line [28](https://github.com/CERTCC/labyrinth/actions/runs/6536547242/job/17749913960#step:7:30)5, in process_modulo
    df = scan_repos(top_dir, mod, divisor)
  File "/opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/labyrinth/repo_processor.py", line 2[32](https://github.com/CERTCC/labyrinth/actions/runs/6536547242/job/17749913960#step:7:34), in scan_repos
    results = df.apply(process_row, axis=1).to_list()
  File "/opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/pandas/core/frame.py", line 10037, in apply
    return op.apply().__finalize__(self, method="apply")
  File "/opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/pandas/core/apply.py", line 837, in apply
    return self.apply_standard()
  File "/opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/pandas/core/apply.py", line 963, in apply_standard
    results, res_index = self.apply_series_generator()
  File "/opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/pandas/core/apply.py", line 979, in apply_series_generator
    results[i] = self.func(v, *self.args, **self.kwargs)
  File "/opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/labyrinth/repo_processor.py", line 1[36](https://github.com/CERTCC/labyrinth/actions/runs/6536547242/job/17749913960#step:7:38), in process_row
    gh_has_newer = _check_repo_newer(ts, repo_name)
  File "/opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/site-packages/labyrinth/repo_processor.py", line 77, in _check_repo_newer
    if m_ts < repo.pushed_at:
TypeError: can't compare offset-naive and offset-aware datetimes
Error: Process completed with exit code 1.

Write unit tests and a workflow to run them

We need some unit tests.

Tests go into <top_dir>/test

Rough naming convention: test_<module_name>.py

Possible bug: reuse of name inside a loop

The code seems to work, but I just noticed in this segment of code that quals is both the thing being iterated over AND something that is assigned to inside the loop. That seems like a bad idea even if it isn't broken. Need to change the name inside the loop to something else.

labyrinth/labyrinth/search.py

Lines 53 to 65 in 53986fe

 quals = [{"pushed": f"{d1}..{d2}"} for d1, d2 in zip(start_dates, end_dates)] 

 print(f"Starting {len(quals)} queries", flush=True) 

 if page_size > 100: 

 raise ValueError("Github requires page_size <= 100") 

 gh = Github(login_or_token=labyrinth.GH_TOKEN, per_page=page_size, retry=2) 

 results = [] 

 for qualifiers in quals: 

 check_rate_limits(gh) 

 quals = " ".join(f"{k}:{v}" for k, v in qualifiers.items()) 

 qstr = f"{query} {quals}"

Add code / workflow to incorporate repositories found by PocOrExp_in_Github

There's a project similar to this one that is doing per-CVE searches on Github. Our choice here is to either

add those searches directly ourselves. Their process appears to be:

Get vul IDs from NVD
Search github for each vul ID
I haven't looked in detail to see if/how often they recheck older IDs.

write a workflow and tool that

checks out https://github.com/ycdxsb/PocOrExp_in_Github
grabs all the github.com//CVE- urls
integrates these in as search results

Of the two of these, item 2 seems the easier one to incorporate, although certainly 1 is more robust to future change.

repo_deep_dive failing with permission error (intermittent)

See for example
https://github.com/CERTCC/labyrinth/actions/runs/5527573575/job/14968312185
log snippet follows

2023-07-12T05:42:25.0531884Z ##[group]Run repo_deep_dive --verbose --mod 3 --divisor 10 --results_dir results/2023/07/11 --max_age 7200
2023-07-12T05:42:25.0532364Z �[36;1mrepo_deep_dive --verbose --mod 3 --divisor 10 --results_dir results/2023/07/11 --max_age 7200�[0m
2023-07-12T05:42:25.0634838Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2023-07-12T05:42:25.0635111Z env:
2023-07-12T05:42:25.0635399Z   pythonLocation: /opt/hostedtoolcache/Python/3.9.17/x64
2023-07-12T05:42:25.0635778Z   PKG_CONFIG_PATH: /opt/hostedtoolcache/Python/3.9.17/x64/lib/pkgconfig
2023-07-12T05:42:25.0636381Z   Python_ROOT_DIR: /opt/hostedtoolcache/Python/3.9.17/x64
2023-07-12T05:42:25.0636726Z   Python2_ROOT_DIR: /opt/hostedtoolcache/Python/3.9.17/x64
2023-07-12T05:42:25.0637048Z   Python3_ROOT_DIR: /opt/hostedtoolcache/Python/3.9.17/x64
2023-07-12T05:42:25.0637388Z   LD_LIBRARY_PATH: /opt/hostedtoolcache/Python/3.9.17/x64/lib
2023-07-12T05:42:25.0637974Z   GH_TOKEN: ***
2023-07-12T05:42:25.0638187Z ##[endgroup]
2023-07-12T05:42:25.7345778Z INFO root - log level: INFO
2023-07-12T05:42:25.7353976Z INFO labyrinth.repo_processor - Reading 1 search result summaries
2023-07-12T05:42:25.7835639Z INFO labyrinth.repo_processor - Found 16 search results to process
2023-07-12T05:42:25.7888462Z INFO labyrinth.repo_processor - Cloning https://github.com/bha-vin/HTB-Beep.git 1 of 16
2023-07-12T05:42:25.9948417Z INFO labyrinth.repo_processor - Cloning https://github.com/codingcore12/SILENT-DOC-EXPLOIT-CLEAN-v5.git 2 of 16
2023-07-12T05:42:26.1897741Z INFO labyrinth.repo_processor - Cloning https://github.com/gcarrilao/hook.git 3 of 16
2023-07-12T05:42:26.3880162Z Traceback (most recent call last):
2023-07-12T05:42:26.3887717Z   File "/opt/hostedtoolcache/Python/3.9.17/x64/bin/repo_deep_dive", line 57, in <module>
2023-07-12T05:42:26.3888708Z     process_modulo(args.results_dir, args.mod, args.divisor)
2023-07-12T05:42:26.3889799Z   File "/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/labyrinth/repo_processor.py", line 276, in process_modulo
2023-07-12T05:42:26.3890321Z     df = scan_repos(top_dir, mod, divisor)
2023-07-12T05:42:26.3891004Z   File "/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/labyrinth/repo_processor.py", line 223, in scan_repos
2023-07-12T05:42:26.3891563Z     results = df.apply(process_row, axis=1).to_list()
2023-07-12T05:42:26.3892230Z   File "/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/pandas/core/frame.py", line 9423, in apply
2023-07-12T05:42:26.3899802Z     return op.apply().__finalize__(self, method="apply")
2023-07-12T05:42:26.3900537Z   File "/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/pandas/core/apply.py", line 678, in apply
2023-07-12T05:42:26.3901242Z     return self.apply_standard()
2023-07-12T05:42:26.3901838Z   File "/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/pandas/core/apply.py", line 798, in apply_standard
2023-07-12T05:42:26.3903187Z     results, res_index = self.apply_series_generator()
2023-07-12T05:42:26.3904231Z   File "/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/pandas/core/apply.py", line 814, in apply_series_generator
2023-07-12T05:42:26.3905168Z     results[i] = self.f(v)
2023-07-12T05:42:26.3905773Z   File "/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/labyrinth/repo_processor.py", line 148, in process_row
2023-07-12T05:42:26.3906211Z     _df = process_git_url(clone_url, workdir)
2023-07-12T05:42:26.3906927Z   File "/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/labyrinth/repo_processor.py", line 49, in process_git_url
2023-07-12T05:42:26.3907353Z     df = process_dir(workdir, workdir)
2023-07-12T05:42:26.3908020Z   File "/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/labyrinth/file_processor.py", line 94, in process_dir
2023-07-12T05:42:26.3908443Z     _df = process_file(fpath, workdir)
2023-07-12T05:42:26.3909108Z   File "/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/labyrinth/file_processor.py", line 28, in process_file
2023-07-12T05:42:26.3909691Z     with open(fpath, "r", encoding="ISO-8859-1") as fp:
2023-07-12T05:42:26.3910183Z PermissionError: [Errno 13] Permission denied: '/tmp/git-clone-svldfy0g/link'
2023-07-12T05:42:26.4827113Z ##[error]Process completed with exit code 1.

SearchRepos warning: Node.js 12 actions are deprecated.

See for example https://github.com/CERTCC/labyrinth/actions/runs/3363048531

Node.js 12 actions are deprecated. For more information see: https://github.blog/changelog/2022-09-22-github-actions-all-actions-will-begin-running-on-node16-instead-of-node12/. Please update the following actions to use Node.js 16: actions/checkout, actions/setup-python, actions/setup-python, actions/checkout

search_github fails to handle 404 errors

See failed job in https://github.com/CERTCC/labyrinth/actions/runs/5527573575/job/14968312185

Log snippet follows

search_github --gh_token *** --start_date 2023-07-09 --end_date 2023-07-10  exploit
  shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
  env:
    pythonLocation: /opt/hostedtoolcache/Python/3.9.[17](https://github.com/CERTCC/labyrinth/actions/runs/5510599382/job/14918209014#step:6:18)/x64
    PKG_CONFIG_PATH: /opt/hostedtoolcache/Python/3.9.17/x64/lib/pkgconfig
    Python_ROOT_DIR: /opt/hostedtoolcache/Python/3.9.17/x64
    Python2_ROOT_DIR: /opt/hostedtoolcache/Python/3.9.17/x64
    Python3_ROOT_DIR: /opt/hostedtoolcache/Python/3.9.17/x64
    LD_LIBRARY_PATH: /opt/hostedtoolcache/Python/3.9.17/x64/lib
Starting 1 queries
Search: exploit pushed:[20](https://github.com/CERTCC/labyrinth/actions/runs/5510599382/job/14918209014#step:6:21)[23](https://github.com/CERTCC/labyrinth/actions/runs/5510599382/job/14918209014#step:6:24)-07-09..2023-07-10
Traceback (most recent call last):
  File "/opt/hostedtoolcache/Python/3.9.17/x64/bin/search_github", line 202, in <module>
    main(search_str, args.start_date, args.end_date, args.overwrite)
  File "/opt/hostedtoolcache/Python/3.9.17/x64/bin/search_github", line 47, in main
    data = do_search(query, start_date, end_date)
  File "/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/labyrinth/search.py", line 79, in do_search
    data = r.raw_data
  File "/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/github/GithubObject.py", line 160, in raw_data
    self._completeIfNeeded()
  File "/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/github/GithubObject.py", line 390, in _completeIfNeeded
    self.__complete()
  File "/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/github/GithubObject.py", line 395, in __complete
    headers, data = self._requester.requestJsonAndCheck("GET", self._url.value)
  File "/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/github/Requester.py", line 442, in requestJsonAndCheck
    return self.__check(
  File "/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/github/Requester.py", line 487, in __check
    raise self.__createException(status, responseHeaders, data)
github.GithubException.UnknownObjectException: [40](https://github.com/CERTCC/labyrinth/actions/runs/5510599382/job/14918209014#step:6:42)4 {"message": "Not Found", "documentation_url": "https://docs.github.com/rest/reference/repos#get-a-repository"}
Error: Process completed with exit code 1.

Split /results and /data into a separate repository

Having both the code and the data it is collecting in a single repository was neat when we started but the data has grown so much (and is updated so frequently) that it's impossible to follow the git commit history for the code anymore.

My proposal is to:

retain the code & automation parts in this repository
move /results and /data to a separate repository or repositories (there's no reason they can't be in the same repo, because the process that generates data into /results is distinct from the process that generates data into /data
adjust the automation so that it updates the data in the new repositories instead of just reflecting it back to this one

Note that implementing thing would probably be a good time to consider implementing #1 as well.

SearchRepos warning: The set-output command is deprecated and will be disabled soon.

See for example https://github.com/CERTCC/labyrinth/actions/runs/3363048531

The set-output command is deprecated and will be disabled soon. Please upgrade to using Environment Files. For more information see: https://github.blog/changelog/2022-10-11-github-actions-deprecating-save-state-and-set-output-commands/

Clarify license

RepoID parsing appears to have a bug

For example, there shouldn't be a file called data/repo_id/(2/84/03/(284034948,)/(284034948,).csv but there is.

uwu my name is mentioned owo

om g

Checking out the repository in Actions is slow

It's taking over 20 minutes for each action to check out the repository before it can do anything. This results in every run of SearchRepos using basically 12 hours of compute time, and the vast bulk of that is checking out the repository. Probably time to optimize the process and see if we can make it more efficient.

What is this for?

May I ask, what is this repository for? My name is mentioned a few times here.

Increase commit chunkiness

All the workflows are currently operating on the main branch. This results in a lot of small commits every time a workflow runs.

There are a few related tasks here

adapt the search and update_summaries jobs in SearchRepos workflow to do their work in a branch or branches, then squash-merge the results back to main. Remove the working branch when done.
subtask of the above, or could be treated separately: Once update_summaries has done its job, the intermediate per-search result json files can be deleted. So they can exist on the working branch, but would never need to make it to main. Only summaries would get into main. Note, however, that this will require changes to generate_summaries so that we can continue to do monthly and yearly summaries too. (It's not as simple as adding a remove-all-non-summaries method.)
adapt the deep_dive and repo2vulid jobs in SearchRepos workflow to do their work in a branch or branches, then squash-merge the results back to main. Unlike the search/summaries items above, we want to retain both the repo and vul-id centric views, so in this case there is no post-action cleanup to be done.

Too many commits to clone?

@trentn reported issues with cloning, possibly due to the number of commits

i think this might be slowing down the action runs too, they seem to be taking more than an hour.

solving #1 might solve this.

SearchRepos is failing because pandas append went away

Prepare all required actions
Run ./.github/actions/single_search
Run search_github --gh_token *** --start_date [2](https://github.com/CERTCC/labyrinth/actions/runs/5179586538/jobs/9332662297#step:6:2)023-06-04 --end_date 2023-06-05  attack poc
Starting 1 queries
Search: attack poc pushed:2023-06-04..2023-06-05
Found 3 results for attack poc pushed:2023-06-04..2023-06-05
df has 3 rows
df has 3 rows after dropna
df has 3 rows after drop out of range dates
Search found 3 results for 2023-06-05
Read 1 records from results/2023/06/05/2023-06-05_attack_poc.json.
Traceback (most recent call last):
  File "/opt/hostedtoolcache/Python/3.9.16/x64/bin/search_github", line 202, in <module>
    main(search_str, args.start_date, args.end_date, args.overwrite)
  File "/opt/hostedtoolcache/Python/3.9.16/x64/bin/search_github", line 139, in main
    out_df = json_df.append(new_df, ignore_index=True)
  File "/opt/hostedtoolcache/Python/3.9.16/x64/lib/python3.9/site-packages/pandas/core/generic.py", line 5989, in __getattr__
    return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'append'
Error: Process completed with exit code 1.

	quals = [{"pushed": f"{d1}..{d2}"} for d1, d2 in zip(start_dates, end_dates)]

	print(f"Starting {len(quals)} queries", flush=True)

	if page_size > 100:
	raise ValueError("Github requires page_size <= 100")
	gh = Github(login_or_token=labyrinth.GH_TOKEN, per_page=page_size, retry=2)

	results = []
	for qualifiers in quals:
	check_rate_limits(gh)
	quals = " ".join(f"{k}:{v}" for k, v in qualifiers.items())
	qstr = f"{query} {quals}"

certcc / labyrinth Goto Github PK

labyrinth's People

Contributors

Stargazers

Watchers

Forkers

labyrinth's Issues

Recommend Projects

Recommend Topics

Recommend Org