datacrawl-ai / datacrawl Goto Github PK
View Code? Open in Web Editor NEWA simple and easy to use web crawler for Python
License: MIT License
A simple and easy to use web crawler for Python
License: MIT License
Instead of having every option for the Spider
class be a separate argument, there should be one or more Options
classes that store these options and then get passed to the Spider
class.
Once this is addressed, the max-attributes
option in .pylintrc
should be set back to 15
Description:
Enhance the existing web crawler to support crawling and extracting content from websites that rely heavily on JavaScript for rendering their content. This feature will involve integrating a headless browser to accurately render and interact with such pages.
Objectives:
Design Considerations:
max_retry_attempts=5
(this can be hardcoded and need not be accepted as a param from user)Currently the test coverage is around 79% (this is including the soon to be removed main
function in crawler.py
)
Because of #19 , the type hint for Spider.crawl_result
broke, and it was temporarily replaced with Dict[str, Dict[str, Any]]
.
This should be fixed to actually reflect the contents of crawl_result
, which has the following format:
crawl_result = {
"url1":{
"urls":["some url", "some other url", ...],
"body": "the html of the page"
},
"url2":{
"urls":["some url", "some other url", ...],
"body": "the html of the page"
},
}
Where body
is only present if the include_body
argument is set to True
, and as such might not always be present.
See #19 for previous discussions about this.
You can verify the type hint is working if the mypy checks pass.
On my machine, I can't run the unit tests.
When I do poetry run pytest
, or when the tests are ran through the push hook, it raises
_______________________________________ ERROR collecting tests/networking/test_validator.py ________________________________________
ImportError while importing test module 'C:\Users\Public\codg\forks\tiny-web-crawler\tests\networking\test_validator.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
C:\Users\Tiago Charrão\AppData\Local\Programs\Python\Python312\Lib\importlib\__init__.py:90: in import_module
return _bootstrap._gcd_import(name[level:], package, level)
tests\networking\test_validator.py:1: in <module>
from tiny_web_crawler.networking.validator import is_valid_url
E ModuleNotFoundError: No module named 'tiny_web_crawler'
For every test file.
Steps to reproduce
py -m venv .
poetry install --with dev
pre-commit install
pre-commit install --hook-type pre-push
poetry run pytest
Additional info
The tests still run fine on the ci.
I was able to run them by replacing
from tiny_web_crawler.etc...
with
from src.tiny_web_crawler.etc...
Feels like you should be able to do from tiny_web_crawler import Spider
, not just from tiny_web_crawler.crawler import Spider
Would be as simple as adding from tiny_web_crawler.crawler import Spider
to __init__.py
Edit:
Running poetry install --with dev
doesn't install the pre-commit hooks, as of right now they need to be installed manually through pre-commit install
Separation of Concerns:
Use of Dependency Injection:
Organize Code into Modules:
Configure git hook to ensure all the test passes before push #20
Update how we import the package and update the readme usage doc
from tiny_web_crawler import Spider
PRS:
#34
url_list
This is a place holder Issue for the first major release v1.0.0
Please feel free to create issue from this list
Currently we do not return the html body from the crawled sites. We only returns the links we find.
['urls', 'body']
Eg:
{
"http://github.com": {
"urls": [
"http://github.com/",
"https://githubuniverse.com/",
"..."
],
"https://github.com/solutions/ci-cd": {
"urls": [
"https://github.com/solutions/ci-cd/",
"https://githubuniverse.com/",
"..."
]
}
}
}
This is a feature to return the html body as well. And the result should look look like this.
{
"http://github.com": {
"urls": [
"http://github.com/",
"https://githubuniverse.com/",
"..."
]
"body": "<html>stuff</html>",
"https://github.com/solutions/ci-cd": {
"urls": [
"https://github.com/solutions/ci-cd/",
"https://githubuniverse.com/",
"..."
],
"body": "<html>other stuff</html>",
}
}
}
At the bottom of crawler.py, there is this piece of code:
def main() -> None:
root_url = 'https://pypi.org/'
max_links = 5
crawler = Spider(root_url, max_links, save_to_file='out.json')
print(Fore.GREEN + f"Crawling: {root_url}")
crawler.start()
if __name__ == '__main__':
main()
I'm just curious as to what the purpose of this being here is. Looking at it seems like a small piece of code to test the module, but if this is the case it should probably be on a separate file like examples.py
, not on crawler.py
(if it is meant to be in the source at all)
Very straightforward feature to add a flag to crawl only the root website and do not crawl to external links.
eg: If the root url provided is https://github.com. It should crawl pages in this domain only. It should not crawl https://exmaple.com
(optional) Can we also support an option to crawl only external links and no internal links. There could be some use cases for that
I noticed this when working in #19
Type checking with mypy is done in the ci, but its not included in the pre-commit hooks.
Should probably be added.
Use this action to create a coverage badge and display it on the readme. (Proposed in #29 )
respect_robots_txt
(Default should be True
because of legal obligation in some jurisdictions)urllib.robotparser
will helps in parsing robots.txt)crawl-delay
if present. Check the rule before crawling a path)A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.