Git Product home page Git Product logo

lazynlp's Issues

"Bug Report: Pylint Warning W0102 - Dangerous Default Value in download_pages Function"

Hello,

I am reaching out regarding your source code file for your Python codes (crawl.py). After running tests using Pylint a few errors present in the source code were found. I felt that it could be something you could fix or look into.

lazynlp/crawl.py:173:0: W0102: Dangerous default value [] as argument (dangerous-default-value)
lazynlp/crawl.py:173:0: W0102: Dangerous default value [] as argument (dangerous-default-value)
lazynlp/crawl.py:222:8: W0105: String statement has no effect (pointless-string-statement)

outputLint.txt

Possible Fix is:

def download_pages(link_file,
                   folder,
                   timeout=30,
                   default_skip=True,
                   extensions=None,
                   domains=None):
    """
    Your function documentation here.
    """
    # Check if extensions and domains are None, and if so, initialize them to empty lists
    if extensions is None:
        extensions = []
    if domains is None:
        domains = []

    # Your function code continues...

This modification ensures that each call to download_pages() gets its own separate empty list for extensions and domains.

I have shown a few errors that I found using Pylint. I also added a link to a text file in which the errors present in all source files were reported using Pylint. Hope this helps.

Regards,
Rebal

urllib fails without headers

Hi,
Thanks for this great tool.

I noticed urllib fails with a Forbidden Request error when I call download_page on some links. You can reproduce the error by trying the code below:

import lazynlp
link = "https://punchng.com/"
page = lazynlp.download_page(link, context=None, timeout=None)

This raises a 403 as shown below.
Screen Shot 2019-09-16 at 2 09 51 PM

I've attempted to create a PR that adds headers to the request by default.

License?

Hello,

There are legal problems with code with no license, where I work using code that has no license attached to it is outright banned.

Would you be so kind to add some sort of license in a file?

It would be very nice of you if it were something permissive, like MIT or Apache 2 or BSD too.

Thank you!

Bug and Error Report for unused variables

Hello,
I am reaching out regarding your Python code. After running tests using Pylint and Pyflakes, there are a few errors considering used variables that are present in the source codes and I felt that it could be something to look into and consider fixing:

Pylint:
lazynlp/cleaner.py:21:4: W0612: Unused variable 'e' (unused-variable)
lazynlp/crawl.py:95:4: W0612: Unused variable 'raw_url' (unused-variable)
lazynlp/crawl.py:118:4: W0612: Unused variable 'e' (unused-variable)
Pyflakes:
lazynlp/crawl.py:95:5: local variable 'raw_url' is assigned to but never used
lazynlp/crawl.py:118:5: local variable 'e' is assigned to but never used
lazynlp/crawl.py:121:5: local variable 'e' is assigned to but never used
lazynlp/crawl.py:152:5: local variable 'e' is assigned to but never used
lazynlp/crawl.py:155:5: local variable 'e' is assigned to but never used

outputFlakes.txt

outputLint.txt

There were a few more present in the remaining source code files, but for sake of not creating too long a message, I have shown a few errors for only variable types that I found using Pyflakes and Pylint. I also added a link to a text file in which the errors/bugs that were present in all source files were reported using Pylint and Pyflakes. Hope this helps.

Regards,
Rebal

syntax error near unexpected token

I see a "syntax error near unexpected token `sgp.urls,'" on submitting the following command:
lazynlp.download_pages(sgp.urls, text_docs, timeout = 30, default_skip = True, extensions = [], domains = [])

Is there something wrong I am doing? sgp.urls has all the URLs, text_docs is the name of the folder to get the outputs into, the rest of the parameters as default.

Sum of n-gram counts

Thanks for building this, really nice work!

I was reading through the code and noticed this line

count.update()

Were you looking to iteratively add up the line-ngram-counts? If yes, I can help complete that and raise a PR

Lmk

All the best

Bugs and Errors Format issue

Hello,

I am reaching out regarding your source code files for your Python codes. After running tests using Pyflakes and Pylint, there were a few errors present in the source codes and I felt that it could be something you could fix or look into.

lazynlp/analytics.py:200:10: C0209: Formatting a regular string which could be an f-string (consider-using-f-string)
lazynlp/analytics.py:231:21: W0613: Unused argument 'file' (unused-argument)
lazynlp/analytics.py:231:27: W0613: Unused argument 'gran' (unused-argument)
lazynlp/analytics.py:231:40: W0613: Unused argument 'max_n' (unused-argument)

lazynlp/cleaner.py:74:8: R1724: Unnecessary "else" after "continue", remove the "else" and de-indent the code inside it (no-else-continue)
lazynlp/crawl.py:96:4: E0633: Attempting to unpack a non-sequence defined at line 65 of tldextract.tldextract (unpacking-non-sequence)
lazynlp/crawl.py:106:0: R0911: Too many return statements (11/6) (too-many-return-statements)

outputLint.txt

These issues cause unnecessary memory use and can be better formatted. I have shown a few errors that I found using Pylint. I also added a link to a text file in which the errors/bugs that were present in all source files were reported using Pylint. Hope this helps.

Regards,
Rebal

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.