Git Product home page Git Product logo

python-libzim's People

Contributors

afreydev avatar coderpaddy avatar fledgexu avatar imaybeabitshy avatar jc-louis avatar jdcaballerov avatar kelson42 avatar legoktm avatar mgautierfr avatar pirate avatar rgaudin avatar yelboudouri avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

python-libzim's Issues

Added metadata string is wrong and not parsed correctly in kiwix-serve

  • there's an unsafe use of memory. if you use the same variable for different articles, the articles count of the zim there are added to gets messed-up
creator = pyzim.ZimCreator("z1.zim")
article = pyzim.ZimArticle(url="1", title="Page 1", content="<html><head></head><body><h1>Welcome to page 1</h1></body></html>")
creator.add_article(article)
creator.finalise()
# T:0; A:4; RA:0; CA:4; UA:0; FA:0; IA:4; C:0; CC:0; UC:0; WC:3
# T:0; Generate cluster offsets
# T:0; ResolveRedirectIndexes
# Resolve redirect
# T:0; Set article indexes
# set index
# T:0; Resolve mimetype
# T:0; create title index
# T:0; 6 title index created
# T:0; 2 clusters created
# T:0; fill header
# T:0; write zimfile :
# T:0;  write header
# T:0;  write mimetype list
# T:0;  write url prt list
# T:0;  write title index
# T:0;  write directory entries
# T:0;  write cluster offset list
# T:0;  write cluster data
# T:0;  write checksum
# T:0; finish
reader = pyzim.ZimReader("z1.zim")
reader.get_article_count()  # 6
reader.get_namespaces_count("A")  # 1 (OK)

creator2 = pyzim.ZimCreator("z2.zim")
article = pyzim.ZimArticle(url="2", title="Page 2", content="<html><head></head><body><h1>Welcome to page 2</h1></body></html>")
creator2.add_article(article)
creator2.finalise()
# T:0; A:4; RA:0; CA:4; UA:0; FA:0; IA:4; C:0; CC:0; UC:0; WC:1
# T:0; Generate cluster offsets
# T:0; ResolveRedirectIndexes
# Resolve redirect
# T:0; Set article indexes
# set index
# T:0; Resolve mimetype
# T:0; create title index
# T:0; 6 title index created
# T:0; 2 clusters created
# T:0; fill header
# T:0; write zimfile :
# T:0;  write header
# T:0;  write mimetype list
# T:0;  write url prt list
# T:0;  write title index
# T:0;  write directory entries
# T:0;  write cluster offset list
# T:0;  write cluster data
# T:0;  write checksum
# T:0; finish
reader2 = pyzim.ZimReader("z2.zim")
reader2.get_article_count()  # 6
reader2.get_namespaces_count("A")  # 1 (OK)

Now add both to kiwix-serve and you'll see that z1 has 1 article while z2 has 2 articles. If we were to reuse that same article many times, we'd the the counter increase by that number.

I don't understand completely what's kiwix-serve? can you give me the whole example ?

Make ZIM creator compression algorithm configurable

Latest git master head of the libzim allows to specify the compression algorithm in the ZIM creator constructor (optional). This is necessary to start to experiment with zstd compression. The python-libzim should be adapted to allow Python users to use this new feature.

Run CI on nightly version of libzim.

We are running the CI using the official release of libzim.

I don't remember what was the rational behind this but I remember there were one :)

However, for big changes like libzim_next we will probably not release libzim before being sure that using project correctly works. This means that we need to be able to run the CI on the nightly build instead of the official release of libzim.

Maybe we can move the test workflow to the nightlybuild (but we need a stable url to download them) and keep the realese workflow on the official release of libzim ?

This is needed for the PR #82

Duplicate article's title is indexed

When trying to add an article with the same URL of an existing article, the libzim issues a warning and doesn't add the article.
Apparently, when doing so, the duplicate article's title is still added to the suggestion index.

Might be a libzim issue though…

def test_title(reader, title):
    nb = reader.get_suggestions_results_count(title)
    res = list(reader.suggest(title))
    print(title, "--", nb, len(res), res)


fpath = pathlib.Path("test.zim")

with Creator(fpath, "welcome", "fra") as creator:
    creator.add_article("welcome", title="Home", content="hello")
    creator.add_article("welcome", title="Maison", content="bonjour")

with libzim.reader.File(fpath) as reader:
    print("nb article", reader.article_count)
    test_title(reader, "Home")
    test_title(reader, "Maison")
    print(reader.get_article("A/welcome"))
Impossible to add A/welcome
  dirent's title to add is : Maison
  existing dirent's title is : Home
T:0; A:5; RA:0; CA:5; UA:0; FA:0; IA:2; C:0; CC:0; UC:0; WC:2
T:0; Waiting for workers
T:0; ResolveRedirectIndexes
Resolve redirect
T:0; Set article indexes
set index
T:0; Resolve mimetype
T:0; create title index
T:0; 6 title index created
T:0; 2 clusters created
T:0; write zimfile :
T:0;  write mimetype list
T:0;  write directory entries
T:0;  write url prt list
T:0;  write title index
T:0;  write cluster offset list
T:0;  write header
T:0;  write checksum
T:0; rename tmpfile to final one.
T:0; finish
nb article 6
Home -- 2 1 ['A/welcome']
Maison -- 2 1 ['A/welcome']
ReadArticle(url=A/welcome, title=Home)

Get the number of results (from `get_matches_estimated()

  • shouldn't we get a generator on reader.search() ? Do the libzim returns a list as-is? It's frequent to have thousands of results on large ZIM files.

I did the same as node-libzim getting 10 results and passing back a string vector. File.h returns Search class unique pointers.

 std::unique_ptr<Search> search(const std::string& query, int start, int end) const;
 std::unique_ptr<Search> suggestions(const std::string& query, int start, int end) const;

Ah ok I see. that's unfortunate. I didn't realized you hardcoded that. We should definitely be able to set those. But we'd need the number of results (from get_matches_estimated()) as well. How do you plan to address that?

Should we inform about index presence?

Would it be interesting to add props to reader.File to inform user about the presence or not of xapian indexes?

I doubt users would be looking for X/fulltext/xapian and X/title/xapian by themselves…

finalize must be always called.

There are two issues here :

  • One on libzim assuming that the finalize method will be call before deletion of the creator.
  • test_libzim.py not calling the finalize method.

For the second point, the more pythonic way to solve this is to use a contextmanager to force the call to finalize in the __exit__ method.

Use of get_filename() is broken

libzim allows the creation of article from a file using getFilename() which is replicated here in get_filename().

This is not directly usable here as getSize() assumes that user is implementing getData() (get_data() here) which renders the getFilename mode useless.

zim::size_type
ZimArticleWrapper::getSize() const
{
return this->getData().size();
}

better docstrings

We discussed that in the old PR but we shall get the constructor's signature from help(libzim.ZimCreator) or help(libzim.ZimCreator.__init__)

If there's no cython way to do it (no idea), maybe we could add that to the class docstring?

Also, I think the wording in finalize() docstring is misleading. Gives the impression that articles are not written (thus kept in memory?) until finalize is called…

finalize(...)
    finalize and write added articles to the file.

    Raises
    ------
        RuntimeError
            If the ZimCreator was already finalized

Libzim for Termux app in android

I can't install libzim in Termux app (aarch64)
I already installed app prerequisites and dependencies.
You may tell me whats the required versions of the dependecies, or how to install it manually from the source.

Please help!

get_article shouldn't raise RuntimeError

Currently, ReadArticle.get_article() raises a RuntimeError should the URL be incorrect. I believe we should either:

  • Add an .has_article(url) methods that returns a boolean informing about the existence of the article
  • Raise a custom, usable Exception like ArticleNotFound or something.

rename get_namespaces_count

File.get_namespaces_count has the following signature:

def get_namespaces_count(self, str ns) -> int:

It returns a single int representing the number of articles in the namespace ns.

I suggest we rename it to get_namespace_count() (singular).

ReadArticle.content should return bytes

Currently it returns a memoryview.

I don't think we'll use the memoryview for anything but get bytes out of it ; so I think we should return bytes. Should the need for accessing the memoryview arise, maybe we'll provide another access to it?

@mgautierfr what do you think? Is it worth returning a memoryview at the expense of convenience? Did you have use cases in mind for this?

Writing part must be in a submodule `writer`

From the user point of view :

  • zim module contains things to read a zim
  • zim.writer module contains things to create a zim.

Class names must not include the Zim :

  • ZimArticle -> zim.writer.Article
  • ZimCreator -> zim.writer.Creator
  • ZimFileReader -> zim.File, zim.Reader or zim.FileReader (I'm not sure what is best)
  • zimFileArticle -> zim.Article
  • ...

Warnings are a bit annoying if libzim is installed system-wide

I'm working on the Debian packaging for python-libzim. It builds against libzim-dev, which is installed to /usr/include/zim. The build process correctly finds it and builds properly (yay!) but it annoyingly constantly puts out warnings like:

[!] Warning: Couldn't find zim/*.h in ./include!
    Hint: You can install them from source from https://github.com/openzim/libzim
          or download a prebuilt release's headers into ./include/zim/*.h
          (or set CFLAGS='-I<library_path>/include')
[!] Warning: Couldn't find libzim.so in ./lib or system library paths!    Hint: You can install it from source from https://github.com/openzim/libzim
          or download a prebuilt libzim.so release into ./lib.
          (or set LDFLAGS='-L<library_path>/lib/[x86_64-linux-gnu]')

Could we suppress the warnings if CFLAGS/LDFLAGS are set in the environment?

Setup cross-compiling CI/CD to make binary distributions for manylinux, macOS, and Windows available

Currently the release.yml Github Action relies on a binary build of libzim.so that is only available for x86_64-linux-gnu, and so python-libzim can only be made available on PyPI as a bdist (binary release) for that platform.

In order to support all the Linux distros, macOS, and Linux seamlessly with prebuilt bdists, the most modern recommended way that I could find for Cython packages (as of 2020-04) is something like this:

multi_release.yml:

on:
  release:
    types: [published]
    tags:
      - v*

manylinux-release-wheel:
  name: Build release wheels for manylinux2010
  runs-on: ubuntu-18.04
  steps:
    - uses: actions/checkout@v2
    - uses: actions/setup-python@v1
      with:
        python-version: 3.6
    - name: Build manylinux2010 wheels
      run: |
        # bdist_wheel gets built inside manylinux docker container to maximize compatibility
        docker run -e manylinux2010 bash -c '
          # e.g. something like this, this could be a script instead though
          python3 -m pip install cython wheel setuptools
          python3 setup.py build_ext
          python3 setup.py bdist_wheel
        '

        # Then wheels are further cleanud up in a 2nd step to make sure they link correctly
        python3 -m pip install -U auditwheel
        for f in artifacts/*.whl; do
          auditwheel repair --plat manylinux2010_x86_64 $f
        done
        ls -al wheelhouse/

    - uses: actions/upload-artifact@v1
      with:
        name: ${{ runner.os }}-wheels
        path: wheelhouse

macos-release-wheel:
  name: Build release wheels for macOS
  runs-on: macos-latest
  strategy:
    matrix:
      python-version: ['3.6', '3.7', '3.8']
  steps:
    - uses: actions/checkout@v2
    - uses: actions/setup-python@v1
      with:
        python-version: ${{ matrix.python-version }}

    - name: Build macOS wheels
      if: github.event_name == 'release'
      run: |
        python3 --version
        python3 -m pip install cython wheel setuptools
        python3 setup.py build_ext
        python3 setup.py bdist_wheel
        
        # Do any needed wheel cleanup here (delocate-wheel is similar to auditwheel)
        for f in bdist/*.whl; do
          delocate-wheel -w wheelhouse $f
        done

    - uses: actions/upload-artifact@v1
      with:
        name: ${{ runner.os }}-wheels
        path: wheelhouse

windows-release-wheel:
  name: Build release wheels for Windows
  runs-on: windows-latest
  strategy:
    matrix:
      python-version: ['3.6', '3.7', '3.8']
  steps:
    - uses: actions/checkout@v2
    - uses: actions/setup-python@v1
      with:
        python-version: ${{ matrix.python-version }}
    
    - name: Build Windows wheels
      shell: bash
      if: github.event_name == 'release'
      run: |
        python -m pip install cython wheel setuptools
        python3 setup.py build_ext
        python3 setup.py bdist_wheel

        # Do any needed wheel cleanup here e.g. auditwheel equivalent for Windows

    - uses: actions/upload-artifact@v1
      with:
        name: ${{ runner.os }}-wheels
        path: artifacts

upload-wheels:
  name: Publish wheels to PyPi
  needs: [manylinux-release-wheel, macos-release-wheel, windows-release-wheel]
  runs-on: ubuntu-18.04
  steps:
    - uses: actions/download-artifact@v1
      with:
        name: Linux-wheels
        path: Linux-wheels
    - uses: actions/download-artifact@v1
      with:
        name: macOS-wheels
        path: macOS-wheels
    - uses: actions/download-artifact@v1
      with:
        name: Windows-wheels
        path: Windows-wheels
    - run: |
        set -e -x
        mkdir -p dist
        cp Linux-wheels/*.whl dist/
        cp macOS-wheels/*.whl dist/
        cp Windows-wheels/*.whl dist/
        ls -la dist/
        sha256sum dist/*.whl

    - uses: pypa/gh-action-pypi-publish@master
      if: github.event_name == 'release'
      with:
        user: __token__
        password: ${{ secrets.pypi_token }}

Further reading:

wildcard search or an iterator over all pages

More on more people are asking how to process wikipedia. It seems to me .zim files are the best way to go.

It missing a way to go through the whole zim file. What I would expected is something along the lines of the following:

for page in File("my.zim"):
    print(page.title)

Is it possible?

don't require min_chunk_size on Creator

Currently, libzim.writer.Creator requires all its arguments to be passed:

def __init__(self, filename, main_page, index_language, min_chunk_size):

While it's understandable for filename and main_page, and debatable for index_language, min_chunk_size is definitely an advanced feature and should receive a default value.

Define Metadata

You chose to Pascalize meta key names. The documentation doesn't require it. It just happens to be the case (ahah) for currently defined ones but as stated there, it's extendable. @kelson42 ?
related, it seems that get_metadata only returns those predefined keys while ZimCreator allows setting any. Confusing.
Yes it's correct, maybe we can put there only the mandatory metadata. Any metadata can be accessed getting an article by url

Yes, let's harmonize to only manage official ones this way and have the article mechanism for the rest ; if @kelson42 and @mgautierfr agrees.

finalize() segfaults if missing metadata

To reproduce:

zim_creator = ZimCreator(get_zim_name())
zim_creator.add_article(ZimTestArticle())
zim_creator.finalize()
Traceback (most recent call last):
  File "libzim/libzim.pyx", line 71, in libzim.ZimArticle._get_data
Segmentation fault

Also segfaults if updating all but one metadata.

Zim with no article leave temp files behind

with libzim.writer.Creator("test_x05.zim", main_page="A/index.html", index_language="fra", min_chunk_size=2048) as zim:
    pass

T:7; A:3; RA:0; CA:3; UA:0; FA:0; IA:0; C:0; CC:0; UC:0; WC:0
T:7; Waiting for workers
T:7; ResolveRedirectIndexes
Resolve redirect
T:7; Set article indexes
set index
T:7; Resolve mimetype
T:7; create title index
T:7; 5 title index created
T:7; 2 clusters created
T:7; write zimfile :
T:7;  write mimetype list
T:7;  write directory entries
T:7;  write url prt list
T:7;  write title index
T:7;  write cluster offset list
T:7;  write header
T:7;  write checksum
T:7; rename tmpfile to final one.
T:7; finish

and on the filesystem:

.rw-r--r--   24k reg  staff 10 Jun 13:33  test_x05.idx
drwxr-xr-x     - reg  staff 10 Jun 13:33  test_x05.idx.tmp
.rw-r--r--   50k reg  staff 10 Jun 13:33  test_x05.zim
.rw-r--r--   24k reg  staff 10 Jun 13:33  test_x05_title.idx
drwxr-xr-x     - reg  staff 10 Jun 13:33  test_x05_title.idx.tmp

zim file is working (albeit without actual article) but the xapian stuff is not removed.

Cannot create a ZIM without an index

At the moment, we cannot create a ZIM file without creating an index.

I'm not sure about the use cases for this (as we can opt-out individually on all articles) but it's an option in mwoffliner so I'm asking. libzim2 plans great improvements to indexes but I don't think it makes it mandatory.

@mgautierfr @kelson42 ?

unknown mime type code 65535

I am trying to read simple english zim file, I get the following error:

$ python babelia-zim2wet.py 
Traceback (most recent call last):
  File "babelia-zim2wet.py", line 25, in <module>
    if article.mimetype != "text/html":
  File "libzim/wrapper.pyx", line 299, in libzim.wrapper.ReadArticle.mimetype.__get__
RuntimeError: unknown mime type code 65535

Versions:

% python --version
Python 3.8.5
% pip install libzim
Requirement already satisfied: libzim in /home/amirouche/.local/share/virtualenvs/arew-KWAEN1E-/lib/python3.8/site-packages (0.0.3.post0)

Exception inside contextmanager should cancel the zim creation

When using the context-manager, should a (non libzim) error occur, the exception is raised but the finalization is done on the Creator as if everything went well.

with libzim.writer.Creator(
    "test_x07.zim", main_page="A/index.html", index_language="eng", min_chunk_size=2048,
) as zfile:
    zfile.add_article(DumbArticle("index.html", "hello", ARTICLE_MIME, "bonjour"))
    raise Exception("outch")
    zfile.add_article(DumbArticle("page2.html", "hello2", ARTICLE_MIME, "bonjour2"))

T:0; A:4; RA:0; CA:4; UA:0; FA:0; IA:1; C:0; CC:0; UC:0; WC:1
T:0; Waiting for workers
T:0; ResolveRedirectIndexes
Resolve redirect
T:0; Set article indexes
set index
T:0; Resolve mimetype
T:0; create title index
T:0; 6 title index created
T:0; 2 clusters created
T:0; write zimfile :
T:0;  write mimetype list
T:0;  write directory entries
T:0;  write url prt list
T:0;  write title index
T:0;  write cluster offset list
T:0;  write header
T:0;  write checksum
T:0; rename tmpfile to final one.
T:0; finish
Traceback (most recent call last):
  File "./demo.py", line 55, in <module>
    raise Exception("outch")
Exception: outch

This results in a valid ZIM file on the filesystem but lacking the second article of course.

I think the expected behavior would be to cancel the ZIM creation and remove temporary files.

@mgautierfr @kelson42 ?

python-libzim 0.0.3 segfaults when used with an updated libzim 6.3.0

In Debian, python-libzim was initially built against libzim 6.2.2. Yesterday I updated libzim to 6.3.0, with no rebuild for python-libzim, and now the python-libzim test suite segfaults:

root@16589ef33bfe:/srv/python-libzim/tests# python3 -m pytest
=============================================== test session starts ===============================================
platform linux -- Python 3.8.6, pytest-4.6.11, py-1.9.0, pluggy-0.13.0
rootdir: /srv/python-libzim
collected 20 items                                                                                                

test_libzim.py Segmentation fault (core dumped)

https://gist.github.com/legoktm/ef86b76242e853d53826d4e885391c03 is the beginning of the backtrace. I can provide full reproduction steps for a docker container if that's needed.

If I rebuild python-libzim against libzim 6.3.0, then it's all fine. But I expect that because there was no ABI bump in libzim, that updating it shouldn't cause any problems for python-libzim. Is that assumption wrong?

Creator doesn't set main_page properly

While Creator requires the main_page argument, it is not set properly on the output ZIM, resulting in the main_page…

with libzim.writer.Creator(
    "test_x01.zim",
    main_page="A/index.html",
    index_language="eng",
    min_chunk_size=2048,
) as zfile:
    zfile.add_article(DumbArticle("index.html", "hello", ARTICLE_MIME, "bonjour"))
    zfile.add_article(DumbArticle("allo.html", "allo", ARTICLE_MIME, "allo how low"))

zfile = libzim.reader.File("test_x01.zim")
print(zfile.main_page_url)
> A/allo.html
print(zfile.get_article("A/index.html"))
> ReadArticle(url=A/index.html, title=)
print(zfile.get_article(zfile.main_page_url))
> ReadArticle(url=A/allo.html, title=)

get_suggestions_results_count result is incorrect

The reader provides the suggestion feature through two methods:

  • get_suggestions_results_count(query) which returns the number of articles for a string
  • suggest(query, start=0, end=10) which returns a generator of url strings for a query string and a optionaly start/stop limits (0, 10 by default).

--

  • suggest() correctly returns only non-redirect articles.
  • get_suggestions_results_count returns the sum of both regular and redirect articles. This number can thus be different from the suggest data and is unusable.
def test_title(reader, title):
    nb = reader.get_suggestions_results_count(title)
    res = list(reader.suggest(title))
    print(title, "--", nb, len(res), res)


fpath = pathlib.Path("test.zim")

with Creator(fpath, "home", "fra") as creator:
    creator.add_article("home", title="Original", content="hello")
    creator.add_redirect("A/home2", "A/home", "Something")
    creator.add_redirect("A/home3", "A/home", "Something2")
    creator.add_redirect("A/home4", "A/home", "Else")
    creator.add_article("lalala", title="Lalala", content="hello again")

with libzim.reader.File(fpath) as reader:
    print("nb article", reader.article_count)
    test_title(reader, "Original")
    test_title(reader, "Something")
    test_title(reader, "Else")
    test_title(reader, "Lala")
nb article 10
Original -- 2 1 ['A/home']
Something -- 3 2 ['A/home2', 'A/home3']
Else -- 2 1 ['A/home4']
Lala -- 1 1 ['A/lalala']

python-libzim not available on dockerhub

The following command found in pypi.org page will fail

% docker run --rm -it openzim:python-libzim
Unable to find image 'openzim:python-libzim' locally
docker: Error response from daemon: pull access denied for openzim, repository does not exist or may require 'docker login': denied: requested access to the resource is denied.
See 'docker run --help'.
%

Apparantly the image is not public or not available.

Cython compilation error with python 3.9.0+ (on ubuntu 20.04)

% pip --version
pip 20.1.1 from /home/amirouche/.local/share/virtualenvs/tmp-kapugSIt/lib/python3.9/site-packages/pip (python 3.9)
% python --version
Python 3.9.0+
% pip install libzim
Collecting libzim
  Using cached libzim-0.0.3.post0.tar.gz (103 kB)
  Installing build dependencies ... error

stdout ends with:

      copying Cython/Utility/CppSupport.cpp -> build/lib.linux-x86_64-3.9/Cython/Utility
      warning: build_py: byte-compiling is disabled, skipping.
  
      running build_ext
      building 'Cython.Plex.Scanners' extension
      creating build/temp.linux-x86_64-3.9
      creating build/temp.linux-x86_64-3.9/tmp
      creating build/temp.linux-x86_64-3.9/tmp/pip-install-g6_5_8w6
      creating build/temp.linux-x86_64-3.9/tmp/pip-install-g6_5_8w6/cython
      creating build/temp.linux-x86_64-3.9/tmp/pip-install-g6_5_8w6/cython/Cython
      creating build/temp.linux-x86_64-3.9/tmp/pip-install-g6_5_8w6/cython/Cython/Plex
      x86_64-linux-gnu-gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -I/home/amirouche/.local/share/virtualenvs/tmp-kapugSIt/include -I/usr/include/python3.9 -c /tmp/pip-install-g6_5_8w6/cython/Cython/Plex/Scanners.c -o build/temp.linux-x86_64-3.9/tmp/pip-install-g6_5_8w6/cython/Cython/Plex/Scanners.o
      /tmp/pip-install-g6_5_8w6/cython/Cython/Plex/Scanners.c: In function ‘__pyx_f_6Cython_4Plex_8Scanners_7Scanner_run_machine_inlined’:
      /tmp/pip-install-g6_5_8w6/cython/Cython/Plex/Scanners.c:3529:11: warning: ‘_PyUnicode_get_wstr_length’ is deprecated [-Wdeprecated-declarations]
       3529 |           __pyx_t_8 = (__pyx_v_data != Py_None)&&(__Pyx_PyUnicode_IS_TRUE(__pyx_v_data) != 0);
            |           ^~~~~~~~~
      In file included from /usr/include/python3.9/unicodeobject.h:1026,
                       from /usr/include/python3.9/Python.h:97,
                       from /tmp/pip-install-g6_5_8w6/cython/Cython/Plex/Scanners.c:19:
      /usr/include/python3.9/cpython/unicodeobject.h:446:26: note: declared here
        446 | static inline Py_ssize_t _PyUnicode_get_wstr_length(PyObject *op) {
            |                          ^~~~~~~~~~~~~~~~~~~~~~~~~~
      /tmp/pip-install-g6_5_8w6/cython/Cython/Plex/Scanners.c:3529:11: warning: ‘PyUnicode_AsUnicode’ is deprecated [-Wdeprecated-declarations]
       3529 |           __pyx_t_8 = (__pyx_v_data != Py_None)&&(__Pyx_PyUnicode_IS_TRUE(__pyx_v_data) != 0);
            |           ^~~~~~~~~
      In file included from /usr/include/python3.9/unicodeobject.h:1026,
                       from /usr/include/python3.9/Python.h:97,
                       from /tmp/pip-install-g6_5_8w6/cython/Cython/Plex/Scanners.c:19:
      /usr/include/python3.9/cpython/unicodeobject.h:580:45: note: declared here
        580 | Py_DEPRECATED(3.3) PyAPI_FUNC(Py_UNICODE *) PyUnicode_AsUnicode(
            |                                             ^~~~~~~~~~~~~~~~~~~
      /tmp/pip-install-g6_5_8w6/cython/Cython/Plex/Scanners.c:3529:11: warning: ‘_PyUnicode_get_wstr_length’ is deprecated [-Wdeprecated-declarations]
       3529 |           __pyx_t_8 = (__pyx_v_data != Py_None)&&(__Pyx_PyUnicode_IS_TRUE(__pyx_v_data) != 0);
            |           ^~~~~~~~~
      In file included from /usr/include/python3.9/unicodeobject.h:1026,
                       from /usr/include/python3.9/Python.h:97,
                       from /tmp/pip-install-g6_5_8w6/cython/Cython/Plex/Scanners.c:19:
      /usr/include/python3.9/cpython/unicodeobject.h:446:26: note: declared here
        446 | static inline Py_ssize_t _PyUnicode_get_wstr_length(PyObject *op) {
            |                          ^~~~~~~~~~~~~~~~~~~~~~~~~~
      /tmp/pip-install-g6_5_8w6/cython/Cython/Plex/Scanners.c: In function ‘__Pyx_modinit_type_init_code’:
      /tmp/pip-install-g6_5_8w6/cython/Cython/Plex/Scanners.c:7334:45: error: ‘PyTypeObject’ {aka ‘struct _typeobject’} has no member named ‘tp_print’
       7334 |   __pyx_type_6Cython_4Plex_8Scanners_Scanner.tp_print = 0;
            |                                             ^
      /tmp/pip-install-g6_5_8w6/cython/Cython/Plex/Scanners.c: In function ‘__Pyx_ParseOptionalKeywords’:
      /tmp/pip-install-g6_5_8w6/cython/Cython/Plex/Scanners.c:7908:21: warning: ‘_PyUnicode_get_wstr_length’ is deprecated [-Wdeprecated-declarations]
       7908 |                     (PyUnicode_GET_SIZE(**name) != PyUnicode_GET_SIZE(key)) ? 1 :
            |                     ^
      In file included from /usr/include/python3.9/unicodeobject.h:1026,
                       from /usr/include/python3.9/Python.h:97,
                       from /tmp/pip-install-g6_5_8w6/cython/Cython/Plex/Scanners.c:19:
      /usr/include/python3.9/cpython/unicodeobject.h:446:26: note: declared here
        446 | static inline Py_ssize_t _PyUnicode_get_wstr_length(PyObject *op) {
            |                          ^~~~~~~~~~~~~~~~~~~~~~~~~~
      /tmp/pip-install-g6_5_8w6/cython/Cython/Plex/Scanners.c:7908:21: warning: ‘PyUnicode_AsUnicode’ is deprecated [-Wdeprecated-declarations]
       7908 |                     (PyUnicode_GET_SIZE(**name) != PyUnicode_GET_SIZE(key)) ? 1 :
            |                     ^
      In file included from /usr/include/python3.9/unicodeobject.h:1026,
                       from /usr/include/python3.9/Python.h:97,
                       from /tmp/pip-install-g6_5_8w6/cython/Cython/Plex/Scanners.c:19:
      /usr/include/python3.9/cpython/unicodeobject.h:580:45: note: declared here
        580 | Py_DEPRECATED(3.3) PyAPI_FUNC(Py_UNICODE *) PyUnicode_AsUnicode(
            |                                             ^~~~~~~~~~~~~~~~~~~
      /tmp/pip-install-g6_5_8w6/cython/Cython/Plex/Scanners.c:7908:21: warning: ‘_PyUnicode_get_wstr_length’ is deprecated [-Wdeprecated-declarations]
       7908 |                     (PyUnicode_GET_SIZE(**name) != PyUnicode_GET_SIZE(key)) ? 1 :
            |                     ^
      In file included from /usr/include/python3.9/unicodeobject.h:1026,
                       from /usr/include/python3.9/Python.h:97,
                       from /tmp/pip-install-g6_5_8w6/cython/Cython/Plex/Scanners.c:19:
      /usr/include/python3.9/cpython/unicodeobject.h:446:26: note: declared here
        446 | static inline Py_ssize_t _PyUnicode_get_wstr_length(PyObject *op) {
            |                          ^~~~~~~~~~~~~~~~~~~~~~~~~~
      /tmp/pip-install-g6_5_8w6/cython/Cython/Plex/Scanners.c:7908:21: warning: ‘_PyUnicode_get_wstr_length’ is deprecated [-Wdeprecated-declarations]
       7908 |                     (PyUnicode_GET_SIZE(**name) != PyUnicode_GET_SIZE(key)) ? 1 :
            |                     ^
      In file included from /usr/include/python3.9/unicodeobject.h:1026,
                       from /usr/include/python3.9/Python.h:97,
                       from /tmp/pip-install-g6_5_8w6/cython/Cython/Plex/Scanners.c:19:
      /usr/include/python3.9/cpython/unicodeobject.h:446:26: note: declared here
        446 | static inline Py_ssize_t _PyUnicode_get_wstr_length(PyObject *op) {
            |                          ^~~~~~~~~~~~~~~~~~~~~~~~~~
      /tmp/pip-install-g6_5_8w6/cython/Cython/Plex/Scanners.c:7908:21: warning: ‘PyUnicode_AsUnicode’ is deprecated [-Wdeprecated-declarations]
       7908 |                     (PyUnicode_GET_SIZE(**name) != PyUnicode_GET_SIZE(key)) ? 1 :
            |                     ^
      In file included from /usr/include/python3.9/unicodeobject.h:1026,
                       from /usr/include/python3.9/Python.h:97,
                       from /tmp/pip-install-g6_5_8w6/cython/Cython/Plex/Scanners.c:19:
      /usr/include/python3.9/cpython/unicodeobject.h:580:45: note: declared here
        580 | Py_DEPRECATED(3.3) PyAPI_FUNC(Py_UNICODE *) PyUnicode_AsUnicode(
            |                                             ^~~~~~~~~~~~~~~~~~~
      /tmp/pip-install-g6_5_8w6/cython/Cython/Plex/Scanners.c:7908:21: warning: ‘_PyUnicode_get_wstr_length’ is deprecated [-Wdeprecated-declarations]
       7908 |                     (PyUnicode_GET_SIZE(**name) != PyUnicode_GET_SIZE(key)) ? 1 :
            |                     ^
      In file included from /usr/include/python3.9/unicodeobject.h:1026,
                       from /usr/include/python3.9/Python.h:97,
                       from /tmp/pip-install-g6_5_8w6/cython/Cython/Plex/Scanners.c:19:
      /usr/include/python3.9/cpython/unicodeobject.h:446:26: note: declared here
        446 | static inline Py_ssize_t _PyUnicode_get_wstr_length(PyObject *op) {
            |                          ^~~~~~~~~~~~~~~~~~~~~~~~~~
      /tmp/pip-install-g6_5_8w6/cython/Cython/Plex/Scanners.c:7924:25: warning: ‘_PyUnicode_get_wstr_length’ is deprecated [-Wdeprecated-declarations]
       7924 |                         (PyUnicode_GET_SIZE(**argname) != PyUnicode_GET_SIZE(key)) ? 1 :
            |                         ^
      In file included from /usr/include/python3.9/unicodeobject.h:1026,
                       from /usr/include/python3.9/Python.h:97,
                       from /tmp/pip-install-g6_5_8w6/cython/Cython/Plex/Scanners.c:19:
      /usr/include/python3.9/cpython/unicodeobject.h:446:26: note: declared here
        446 | static inline Py_ssize_t _PyUnicode_get_wstr_length(PyObject *op) {
            |                          ^~~~~~~~~~~~~~~~~~~~~~~~~~
      /tmp/pip-install-g6_5_8w6/cython/Cython/Plex/Scanners.c:7924:25: warning: ‘PyUnicode_AsUnicode’ is deprecated [-Wdeprecated-declarations]
       7924 |                         (PyUnicode_GET_SIZE(**argname) != PyUnicode_GET_SIZE(key)) ? 1 :
            |                         ^
      In file included from /usr/include/python3.9/unicodeobject.h:1026,
                       from /usr/include/python3.9/Python.h:97,
                       from /tmp/pip-install-g6_5_8w6/cython/Cython/Plex/Scanners.c:19:
      /usr/include/python3.9/cpython/unicodeobject.h:580:45: note: declared here
        580 | Py_DEPRECATED(3.3) PyAPI_FUNC(Py_UNICODE *) PyUnicode_AsUnicode(
            |                                             ^~~~~~~~~~~~~~~~~~~
      /tmp/pip-install-g6_5_8w6/cython/Cython/Plex/Scanners.c:7924:25: warning: ‘_PyUnicode_get_wstr_length’ is deprecated [-Wdeprecated-declarations]
       7924 |                         (PyUnicode_GET_SIZE(**argname) != PyUnicode_GET_SIZE(key)) ? 1 :
            |                         ^
      In file included from /usr/include/python3.9/unicodeobject.h:1026,
                       from /usr/include/python3.9/Python.h:97,
                       from /tmp/pip-install-g6_5_8w6/cython/Cython/Plex/Scanners.c:19:
      /usr/include/python3.9/cpython/unicodeobject.h:446:26: note: declared here
        446 | static inline Py_ssize_t _PyUnicode_get_wstr_length(PyObject *op) {
            |                          ^~~~~~~~~~~~~~~~~~~~~~~~~~
      /tmp/pip-install-g6_5_8w6/cython/Cython/Plex/Scanners.c:7924:25: warning: ‘_PyUnicode_get_wstr_length’ is deprecated [-Wdeprecated-declarations]
       7924 |                         (PyUnicode_GET_SIZE(**argname) != PyUnicode_GET_SIZE(key)) ? 1 :
            |                         ^
      In file included from /usr/include/python3.9/unicodeobject.h:1026,
                       from /usr/include/python3.9/Python.h:97,
                       from /tmp/pip-install-g6_5_8w6/cython/Cython/Plex/Scanners.c:19:
      /usr/include/python3.9/cpython/unicodeobject.h:446:26: note: declared here
        446 | static inline Py_ssize_t _PyUnicode_get_wstr_length(PyObject *op) {
            |                          ^~~~~~~~~~~~~~~~~~~~~~~~~~
      /tmp/pip-install-g6_5_8w6/cython/Cython/Plex/Scanners.c:7924:25: warning: ‘PyUnicode_AsUnicode’ is deprecated [-Wdeprecated-declarations]
       7924 |                         (PyUnicode_GET_SIZE(**argname) != PyUnicode_GET_SIZE(key)) ? 1 :
            |                         ^
      In file included from /usr/include/python3.9/unicodeobject.h:1026,
                       from /usr/include/python3.9/Python.h:97,
                       from /tmp/pip-install-g6_5_8w6/cython/Cython/Plex/Scanners.c:19:
      /usr/include/python3.9/cpython/unicodeobject.h:580:45: note: declared here
        580 | Py_DEPRECATED(3.3) PyAPI_FUNC(Py_UNICODE *) PyUnicode_AsUnicode(
            |                                             ^~~~~~~~~~~~~~~~~~~
      /tmp/pip-install-g6_5_8w6/cython/Cython/Plex/Scanners.c:7924:25: warning: ‘_PyUnicode_get_wstr_length’ is deprecated [-Wdeprecated-declarations]
       7924 |                         (PyUnicode_GET_SIZE(**argname) != PyUnicode_GET_SIZE(key)) ? 1 :
            |                         ^
      In file included from /usr/include/python3.9/unicodeobject.h:1026,
                       from /usr/include/python3.9/Python.h:97,
                       from /tmp/pip-install-g6_5_8w6/cython/Cython/Plex/Scanners.c:19:
      /usr/include/python3.9/cpython/unicodeobject.h:446:26: note: declared here
        446 | static inline Py_ssize_t _PyUnicode_get_wstr_length(PyObject *op) {
            |                          ^~~~~~~~~~~~~~~~~~~~~~~~~~
      /tmp/pip-install-g6_5_8w6/cython/Cython/Plex/Scanners.c: In function ‘__Pyx_PyUnicode_Substring’:
      /tmp/pip-install-g6_5_8w6/cython/Cython/Plex/Scanners.c:8399:9: warning: ‘PyUnicode_FromUnicode’ is deprecated [-Wdeprecated-declarations]
       8399 |         return PyUnicode_FromUnicode(NULL, 0);
            |         ^~~~~~
      In file included from /usr/include/python3.9/unicodeobject.h:1026,
                       from /usr/include/python3.9/Python.h:97,
                       from /tmp/pip-install-g6_5_8w6/cython/Cython/Plex/Scanners.c:19:
      /usr/include/python3.9/cpython/unicodeobject.h:551:42: note: declared here
        551 | Py_DEPRECATED(3.3) PyAPI_FUNC(PyObject*) PyUnicode_FromUnicode(
            |                                          ^~~~~~~~~~~~~~~~~~~~~
      error: command '/usr/bin/x86_64-linux-gnu-gcc' failed with exit code 1
      ----------------------------------------
  ERROR: Command errored out with exit status 1: /home/amirouche/.local/share/virtualenvs/tmp-kapugSIt/bin/python -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-g6_5_8w6/cython/setup.py'"'"'; __file__='"'"'/tmp/pip-install-g6_5_8w6/cython/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /tmp/pip-record-yd42iafn/install-record.txt --single-version-externally-managed --prefix /tmp/pip-build-env-gthfa0j3/overlay --compile --install-headers /home/amirouche/.local/share/virtualenvs/tmp-kapugSIt/include/site/python3.9/cython Check the logs for full command output.
  ----------------------------------------
ERROR: Command errored out with exit status 1: /home/amirouche/.local/share/virtualenvs/tmp-kapugSIt/bin/python /home/amirouche/.local/share/virtualenvs/tmp-kapugSIt/lib/python3.9/site-packages/pip install --ignore-installed --no-user --prefix /tmp/pip-build-env-gthfa0j3/overlay --no-warn-script-location --no-binary :none: --only-binary :none: -i https://pypi.org/simple -- 'setuptools >= 35.0.2' 'wheel >= 0.29.0' twine 'cython == 0.29.6' Check the logs for full command output.
% 

Iterate over all articles efficiently

Issue #300 and PR #301 added support to iterate over all articles in the order they are stored in the file, for performance reasons. How to use it through the python bindings?

Add getMainPage and setter the Creator wrapper

Probably upstream limitation but requiring main_page on ZimCreator creation seems weird. You allow to set it to blank but can't change it afterwards?

I used the same construct as node-libzim that fixes it from creation.
ZimCreator must implement a function getMainPage so that when finishZimCreation() (C++) is called the header is filled with it fillHeader().

With the current design I can overload the wrapper function and add a setter if needed.

I think it makes sense, yes.

Failed install on Python 3.7.6

When I try to install by command:
python-m pip install python-libzim-master

 Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Preparing wheel metadata ... done
Building wheels for collected packages: libzim
  Building wheel for libzim (PEP 517) ... error
  ERROR: Command errored out with exit status 1:
 Complete output (34 lines):
  [!] Warning: Couldn't find zim/*.h in ./include!
      Hint: You can install them from source from https://github.com/openzim/lib
zim
            or download a prebuilt release's headers into ./include/zim/*.h
            (or set CFLAGS='-I<library_path>/include')
  [!] Warning: Couldn't find libzim.so in ./lib or system library paths!    Hint
: You can install it from source from https://github.com/openzim/libzim
            or download a prebuilt libzim.so release into ./lib.
            (or set LDFLAGS='-L<library_path>/lib/[x86_64-linux-gnu]')
  running bdist_wheel
  running build
  running build_py
  creating build
  creating build\lib.win32-3.7
  creating build\lib.win32-3.7\libzim
  copying libzim\reader.py -> build\lib.win32-3.7\libzim
  copying libzim\writer.py -> build\lib.win32-3.7\libzim
  copying libzim\__init__.py -> build\lib.win32-3.7\libzim
  running egg_info
  writing libzim.egg-info\PKG-INFO
  writing dependency_links to libzim.egg-info\dependency_links.txt
  writing top-level names to libzim.egg-info\top_level.txt
  reading manifest file 'libzim.egg-info\SOURCES.txt'
  reading manifest template 'MANIFEST.in'
  warning: no previously-included files matching '__pycache__\*' found anywhere
in distribution
  writing manifest file 'libzim.egg-info\SOURCES.txt'
  copying libzim\lib.cxx -> build\lib.win32-3.7\libzim
  copying libzim\lib.h -> build\lib.win32-3.7\libzim
  copying libzim\wrapper.cpp -> build\lib.win32-3.7\libzim
  copying libzim\wrapper.h -> build\lib.win32-3.7\libzim
  copying libzim\wrapper.pxd -> build\lib.win32-3.7\libzim
  copying libzim\wrapper.pyx -> build\lib.win32-3.7\libzim
  copying libzim\wrapper_api.h -> build\lib.win32-3.7\libzim
  running build_ext
  building 'libzim.wrapper' extension
  error: Microsoft Visual C++ 14.0 or greater is required. Get it with "Microsoft C++ Build Tools": https://visualstudio.microsoft.com/visual-cpp-build-tools/
  ----------------------------------------
  ERROR: Failed building wheel for libzim
Failed to build libzim
ERROR: Could not build wheels for libzim which use PEP 517 and cannot be install
ed directly

What am I doing wrong?

we should accept pathlib.Path

With the general move in the stdlib to accept pathlib.Path for all path-related operations, it seems a little odd and dated to have to to send str to the libzim.

I'd like to harmonize with this move so that the Creator and File can take both str and Path and we'd check the type in the python wrapper and converts it to str for the binding.

I shall mention that functions usually returns same type as provided. We only output path for File.filename I think. We would acknowledge that we'd still return str here (otherwise we'd have to store the type info in the wrapper)

get_metadata() doesn't return a dict

get_metadata()'s docstring announces a returned dict with the file's metadata.

Instead, it expects a parameter, the metadata Key and returns a memory-view of the value for this key.

  • we probably want str values (about all use cases) here as the raw content would still be available at r.get_article("M/Tags").content
  • shall we have that dict of all defined meta somewhere? It would definitely be useful.
  • docstring should be adjusted to the chosen behavior

what is get_filename on writer Article?

What's the purpose of writer.Article.get_filename() ? If setting it to an empty string, it works as expected but if setting it to something else, I get an error at zim creation…

I understand that the ZIM file itself (the Creator) has a filename, but what about the Article? Isn't the url it's ID?

T:15; A:4; RA:0; CA:4; UA:0; FA:1; IA:1; C:0; CC:0; UC:0; WC:0
T:15; Waiting for workers
terminate called after throwing an instance of 'std::runtime_error'
  what():  cannot open no-idea
Aborted

segfault on variable reuse

Reusing the same variable for different Creator segfaults

In [1]: import libzim.writer

In [2]: c = libzim.writer.Creator("/data/test03.zim", "/A/index.html", "fra", 2048)
/data/test03.zim

In [3]: c = libzim.writer.Creator("/data/test04.zim", "/A/index.html", "fra", 2048)
/data/test04.zim

In [4]: Segmentation fault

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.