lmmx / campbells Goto Github PK
View Code? Open in Web Editor NEWA condensed web scraping library :canned_food:
License: MIT License
A condensed web scraping library :canned_food:
License: MIT License
The builder.build
module was split off from its init for state/behaviour separation but it's still in need of refactoring.
Put the base and the individual builders in separate modules: in fact make build.py
a subpackage tree_builders
.
Remember to chase down any references in the rest of the library
builder.__init__
module namespace is preservedThe tree builders appear to have no methods in common. This is a little surprising. It may make more sense after reading them in more detail.
Perhaps they can be split out into subpackages to make the function of each component clearer (or perhaps this would just create sprawl and no insight).
There is also a SAXTreeBuilder
class marked as non-operational
A Campbells treebuilder that listens for SAX events.
This is not currently used for anything, but it demonstrates how a simple TreeBuilder would work.
Delete this or consider how it could be generalised to a different interface (perhaps as the basis of bisque
)
At the end of formatter.py
, aliases are set using the REGISTRY
class var and the __init__
methods of the Formatter
subclasses HTMLFormatter
and XMLFormatter
:
campbells/src/campbells/formatter.py
Lines 160 to 199 in bf269f7
This is not a great pattern, instead use a classmethod to make it clearer what is in common (this section is hard to read)
I checked and mvdef
used setup.py
to register a console script entrypoint, magic-scrape
shows how to do it in pyproject.toml
:
[project.scripts]
magicscrape = "magic_scrape.cli:main"
The campbells.main
module has a leftover part that was moved out of __init__.py
(which would be invoked if you called the init module as a script directly). Expose it via console script entrypoint
campbells/src/campbells/main.py
Lines 858 to 861 in 472e322
The campbells.element
module is far larger than the rest (2528 lines), effectively a subpackage declaring elements.
This seems like a reasonable case for modularisation, probably in groups (i.e. not one module per element class):
There are 2 implemented _all_strings
methods and in one the len(value) > 0
check occurs only when if strip
is True.
This is NavigableString
value = self
if strip:
value = value.strip()
if len(value) > 0:
yield value
This is Tag
(where multiple descendants are looped over)
if strip:
descendant = descendant.strip()
if len(descendant) == 0:
continue
yield descendant
Is this a bug?
If so it would give inconsistent handling between NavigableString
vs. Tag
Potentially should be reported upstream.
The campbells.dammit
module is the 2nd largest (1131 lines), effectively a subpackage handling unicode encodings.
This seems like a reasonable case for modularisation, probably one module per class.
Similar task to #3
I don't think there are any type annotations in this project.
Running mypy from the env managed by pdm
gives 18 messages (14 errors in 7 files)
src/campbells/builder/_html5lib.py:
8: error: Skipping analyzing "html5lib": module is installed, but missing library stubs or py.typed marker
9: error: Skipping analyzing "html5lib.constants": module is installed, but missing library stubs or py.typed marker
9: note: See https://mypy.readthedocs.io/en/stable/running_mypy.html#missing-imports
34: error: Skipping analyzing "html5lib.treebuilders": module is installed, but missing library stubs or py.typed marker
src/campbells/builder/_lxml.py:
4: error: Skipping analyzing "lxml": module is installed, but missing library stubs or py.typed marker
src/campbells/tests/__init__.py:
44: error: Skipping analyzing "lxml.etree": module is installed, but missing library stubs or py.typed marker
44: error: Skipping analyzing "lxml": module is installed, but missing library stubs or py.typed marker
src/campbells/tests/fuzz_test.py:
18: error: Skipping analyzing "html5lib": module is installed, but missing library stubs or py.typed marker
19: error: Skipping analyzing "lxml": module is installed, but missing library stubs or py.typed marker
src/campbells/dammit.py:
28: error: Cannot find implementation or library stub for module named "cchardet"
33: error: Library stubs not installed for "chardet" (or incompatible with Python 3.10)
33: note: Hint: "python3 -m pip install types-chardet"
33: note: (or run "mypy --install-types" to install all missing stub packages)
37: error: Cannot find implementation or library stub for module named "charset_normalizer"
src/campbells/diagnose.py:
44: error: Skipping analyzing "lxml": module is installed, but missing library stubs or py.typed marker
52: error: Skipping analyzing "html5lib": module is installed, but missing library stubs or py.typed marker
/home/louis/miniconda3/envs/campbells/lib/python3.10/site-packages/chinois/css_match.py:554: error: invalid syntax
Found 14 errors in 7 files (errors prevented further checking)
The tree builders are all named in different ways, and some names are used for features too. Messy: what's needed is to sit down and study where these names actually go (i.e. what the impact would be of changing them to standardise). The overall data flow is ad-hoc due to use of these primitive string types directly at the site of use, when they are clearly parts of a centralisable object (for now I've just put what I can into a module named parser_names
).
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.