Git Product home page Git Product logo

campbells's People

Contributors

lmmx avatar

Watchers

 avatar  avatar

campbells's Issues

Split tree builder base from subclasses

The builder.build module was split off from its init for state/behaviour separation but it's still in need of refactoring.

Put the base and the individual builders in separate modules: in fact make build.py a subpackage tree_builders.

Remember to chase down any references in the rest of the library

  • Chinois will not be affected if the builder.__init__ module namespace is preserved

Clarify use of parser library names for tree builder names/features

The tree builders are all named in different ways, and some names are used for features too. Messy: what's needed is to sit down and study where these names actually go (i.e. what the impact would be of changing them to standardise). The overall data flow is ad-hoc due to use of these primitive string types directly at the site of use, when they are clearly parts of a centralisable object (for now I've just put what I can into a module named parser_names).

Report bug upstream: `all_strings` value length check inconsistent

There are 2 implemented _all_strings methods and in one the len(value) > 0 check occurs only when if strip is True.

This is NavigableString

        value = self
        if strip:
            value = value.strip()
        if len(value) > 0:
            yield value

This is Tag (where multiple descendants are looped over)

            if strip:
                descendant = descendant.strip()
                if len(descendant) == 0:
                    continue
            yield descendant

Is this a bug?

If so it would give inconsistent handling between NavigableString vs. Tag

Potentially should be reported upstream.

Console script entrypoint to replace (moved) direct invocation route

I checked and mvdef used setup.py to register a console script entrypoint, magic-scrape shows how to do it in pyproject.toml:

[project.scripts]
magicscrape = "magic_scrape.cli:main"

The campbells.main module has a leftover part that was moved out of __init__.py (which would be invoked if you called the init module as a script directly). Expose it via console script entrypoint

# If this file is run as a script, act as an HTML pretty-printer.
if __name__ == "__main__":
soup = CampbellsSoup(sys.stdin)
print(soup.prettify())

Split `dammit` module monolith into subpackage

The campbells.dammit module is the 2nd largest (1131 lines), effectively a subpackage handling unicode encodings.

This seems like a reasonable case for modularisation, probably one module per class.

  • EntitySubstitution
  • EncodingDetector
  • UnicodeDammit

Similar task to #3

Can tree builders be generalised/simplified/pruned?

The tree builders appear to have no methods in common. This is a little surprising. It may make more sense after reading them in more detail.

Perhaps they can be split out into subpackages to make the function of each component clearer (or perhaps this would just create sprawl and no insight).

There is also a SAXTreeBuilder class marked as non-operational

A Campbells treebuilder that listens for SAX events.

This is not currently used for anything, but it demonstrates how a simple TreeBuilder would work.

Delete this or consider how it could be generalised to a different interface (perhaps as the basis of bisque)

Split out `element` module monolith into subpackage

The campbells.element module is far larger than the rest (2528 lines), effectively a subpackage declaring elements.

This seems like a reasonable case for modularisation, probably in groups (i.e. not one module per element class):

  • NamespacedAttribute
  • AttributeValueWithCharsetSubstitution
  • CharsetMetaAttributeValue
  • ContentMetaAttributeValue
  • PageElement
  • NavigableString
  • PreformattedString
  • CData
  • ProcessingInstruction
  • XMLProcessingInstruction
  • Comment
  • Declaration
  • Doctype
  • Stylesheet
  • Script
  • TemplateString
  • RubyTextString
  • RubyParenthesisString
  • Tag
  • SoupStrainer
  • ResultSet

Set up aliases in `campbells.formatter` with a classmethod

At the end of formatter.py, aliases are set using the REGISTRY class var and the __init__ methods of the Formatter subclasses HTMLFormatter and XMLFormatter:

class HTMLFormatter(Formatter):
"""A generic Formatter for HTML."""
REGISTRY = {}
def __init__(self, *args, **kwargs):
super().__init__(self.HTML, *args, **kwargs)
class XMLFormatter(Formatter):
"""A generic Formatter for XML."""
REGISTRY = {}
def __init__(self, *args, **kwargs):
super().__init__(self.XML, *args, **kwargs)
# Set up aliases for the default formatters.
HTMLFormatter.REGISTRY["html"] = HTMLFormatter(
entity_substitution=EntitySubstitution.substitute_html,
)
HTMLFormatter.REGISTRY["html5"] = HTMLFormatter(
entity_substitution=EntitySubstitution.substitute_html,
void_element_close_prefix=None,
empty_attributes_are_booleans=True,
)
HTMLFormatter.REGISTRY["minimal"] = HTMLFormatter(
entity_substitution=EntitySubstitution.substitute_xml,
)
HTMLFormatter.REGISTRY[None] = HTMLFormatter(entity_substitution=None)
XMLFormatter.REGISTRY["html"] = XMLFormatter(
entity_substitution=EntitySubstitution.substitute_html,
)
XMLFormatter.REGISTRY["minimal"] = XMLFormatter(
entity_substitution=EntitySubstitution.substitute_xml,
)
XMLFormatter.REGISTRY[None] = Formatter(
Formatter(Formatter.XML, entity_substitution=None),
)

This is not a great pattern, instead use a classmethod to make it clearer what is in common (this section is hard to read)

Introduce type annotations

I don't think there are any type annotations in this project.

  • mypy is commented out in the pre-commit: would have prevented initial project setup
  • There are some types annotated in docstrings (is there a way to automatically move those into annotations?)

Running mypy from the env managed by pdm gives 18 messages (14 errors in 7 files)

  • it said "errors prevented further checking" so these may not be complete
src/campbells/builder/_html5lib.py:
8: error: Skipping analyzing "html5lib": module is installed, but missing library stubs or py.typed marker
9: error: Skipping analyzing "html5lib.constants": module is installed, but missing library stubs or py.typed marker
9: note: See https://mypy.readthedocs.io/en/stable/running_mypy.html#missing-imports
34: error: Skipping analyzing "html5lib.treebuilders": module is installed, but missing library stubs or py.typed marker

src/campbells/builder/_lxml.py:
4: error: Skipping analyzing "lxml": module is installed, but missing library stubs or py.typed marker

src/campbells/tests/__init__.py:
44: error: Skipping analyzing "lxml.etree": module is installed, but missing library stubs or py.typed marker
44: error: Skipping analyzing "lxml": module is installed, but missing library stubs or py.typed marker

src/campbells/tests/fuzz_test.py:
18: error: Skipping analyzing "html5lib": module is installed, but missing library stubs or py.typed marker
19: error: Skipping analyzing "lxml": module is installed, but missing library stubs or py.typed marker

src/campbells/dammit.py:
28: error: Cannot find implementation or library stub for module named "cchardet"
33: error: Library stubs not installed for "chardet" (or incompatible with Python 3.10)
33: note: Hint: "python3 -m pip install types-chardet"
33: note: (or run "mypy --install-types" to install all missing stub packages)
37: error: Cannot find implementation or library stub for module named "charset_normalizer"

src/campbells/diagnose.py:
44: error: Skipping analyzing "lxml": module is installed, but missing library stubs or py.typed marker
52: error: Skipping analyzing "html5lib": module is installed, but missing library stubs or py.typed marker

/home/louis/miniconda3/envs/campbells/lib/python3.10/site-packages/chinois/css_match.py:554: error: invalid syntax
Found 14 errors in 7 files (errors prevented further checking)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.