Git Product home page Git Product logo

iscc-cli's Introduction

iscc-cli - Command Line Tool

Version Downloads

Caution

This implementation is currently not up to date and does NOT generate valid ISCCs.

A command line tool that creates ISCC Codes for digital media files based on the reference implementation.

Table of Contents

Background

The International Standard Content Code is a proposal for an open standard for decentralized content identification. ISCC Codes are generated algorithmically from the content itself and offer many powerful features like content similarity clustering and partial integrity checks. If you want to learn more about the ISCC please check out https://iscc.codes.

This tool offers an easy way to generate ISCC codes from the command line. It supports content extraction via Apache Tika and uses the ISCC reference implementation.

Supported Media File Types

Text

doc, docx, epub, html, odt, pdf, rtf, txt, xml, ibooks, md, xls, mobi ...

Image

gif, jpg, png, tif, bmp, psd, eps ...

Note: EPS (postscript) support requires Ghostscript to be installed on your system and available on your PATH. (Make sure you can run gs from your command line.)

Audio

aif, mp3, ogg, wav ...

Note: Support for the Audio-ID is experimental and not yet part of the specification

Video

3gp, 3g2, asf, avi, flv, gif, mpg, mp4, mkv, mov, ogv, webm, wmv ...

Note: Support for the Video-ID is experimentel and not yet part of the specification

Requirements

NOTE: Requires JAVA to be installed and on your path!

iscc-cli is tested on Linux, Windows, and macOS with Python 3.6/3.7/3.8.

This tool depends on tika-python. Tika is used for extracting metadata and content from media files before generating ISCC Codes. On first execution of the iscc command line tool it will automatically download and launch the Java Tika Server in the background (this may take some time). Consecutive runs will access the existing Tika instance. You may explicitly pre-launch the Tika server with $ iscc init

Install

The ISCC command line tool is published with the package name iscc-cli on the Python Package Index and can be installed with pip:

$ pip3 install iscc-cli

Self-contained Windows binary executables are available for download at: https://github.com/iscc/iscc-cli/releases/

Usage

Getting Help

Show help overview by calling iscc without any arguments:

$ iscc
Usage: iscc [OPTIONS] COMMAND [ARGS]...

Options:
  --version  Show the version and exit.
  --help     Show this message and exit.

Commands:
  gen*   Generate ISCC Code for FILE.
  batch  Create ISCC Codes for all files in PATH.
  dump   Dump Tika extraction results for PATH (file or url path).
  info   Show information about environment.
  init   Inititalize and check environment.
  sim    Estimate Similarity of ISCC Codes A & B.
  test   Test conformance with latest reference data.
  web    Generate ISCC Code from URL.

Get help for a specific command by entering iscc <command>:

$ iscc gen
Usage: iscc gen [OPTIONS] FILE

  Generate ISCC Code for FILE.

Options:
  -g, --guess       Guess title (first line of text).
  -t, --title TEXT  Title for Meta-ID creation.
  -e, --extra TEXT  Extra text for Meta-ID creation.
  -v, --verbose     Enables verbose mode.
  -h, --help        Show this message and exit.

Generating ISCC Codes

For local files

The gen command generates an ISCC Code for a single file:

$ iscc gen tests/image/demo.jpg
ISCC:CC1GG3hSxtbWU-CYDfTq7Qc7Fre-CDYkLqqmQJaQk-CRAPu5NwQgAhv

The gen command is default so you can skip it and simply do $ iscc tests/demo.jpg

To get a more detailed result use the -v (--verbose) option:

$ iscc -v tests/image/demo.jpg
ISCC:CC1GG3hSxtbWU-CYDfTq7Qc7Fre-CDYkLqqmQJaQk-CRAPu5NwQgAhv
Norm Title: concentrated cat
Tophash:    7a8d0c513142c45f417e761355bf71f11ad61d783cd8958ffc0712d00224a4d0
Filepath:   tests/image/demo.jpg
GMT:        image

See iscc batch for help on how to generate ISCC codes for multiple files at once.

For web urls

The web command allows you to create ISCC codes from URLs:

$ iscc web https://iscc.foundation/news/images/lib-arch-ottawa.jpg
ISCC:CCbUCUSqQpyJo-CYaHPGcucqwe3-CDt4nQptEGP6M-CRestDoG7xZFy

Similarity of ISCC Codes

The sim command computes estimated similarity of two ISCC Codes:

$ iscc sim CCUcKwdQc1jUM CCjMmrCsKWu1D
Estimated Similarity of Meta-ID: 78.00 % (56 of 64 bits match)

You may also compare full four-component ISCC Codes.

Using from your python code

While this package is not built to be used as a library, some of the high level commands to generate ISCC Codes are exposed as vanilla python functions:

from iscc_cli import lib
from pprint import pprint

pprint(lib.iscc_from_url("https://iscc.foundation/news/images/lib-arch-ottawa.jpg"))

{'gmt': 'image',
 'iscc': 'CCbUCUSqQpyJo-CYaHPGcucqwe3-CDt4nQptEGP6M-CRestDoG7xZFy',
 'norm_title': 'library and archives canada ottawa',
 'tophash': 'e264cc07209bfaecc291f97c7f8765229ce4c1d36ac6901c477e05b2422eea3e'}

Maintainers

@titusz

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

Please make sure to update tests as appropriate.

You may also want join our developer chat on Telegram at https://t.me/iscc_dev.

Change Log

[0.9.12] - 2021-07-16

  • Update to custom mediatype detection (without Tika requirement)
  • Update dependencies

[0.9.11] - 2020-06-12

  • Update dependencies
  • Remove support for creating ISCC codes from youtube urls

[0.9.10] - 2020-05-19

  • Fixed issue with mime-type detection
  • Changed wording of similarity output
  • Added CSV-compatible output for batch command
  • Added debug option for batch command
  • Updated dependencies

[0.9.9] - 2020-05-18

  • Fixed issue with tika & macOS
  • Added macOS ci testing
  • Updated dependencies

[0.9.8] - 2020-05-13

  • Updated Content-ID-Audio for robustness against transcoding (breaking change)
  • Changed similarity calculation to match with web demo
  • Fixed bug in mime-type detection
  • Updated dependencies

[0.9.7] - 2020-05-01

  • Add support for flac and opus audio formats
  • Update dependencies

[0.9.6] - 2020-04-24

  • Support urls with dump command
  • Updated tika 1.24 and fpcalc 1.50
  • Use filename for meta-id as last resort
  • Switch to signed audio fingerprint (breaking change)
  • Bugfixes and stability improvements

[0.9.5] - 2020-03-02

  • Support mobi7
  • Support mobi print replica
  • Support mobi with web command

[0.9.4] - 2020-03-02

  • Add experimental support for mobi files

[0.9.3] - 2020-02-18

  • Add support for XHTML
  • Fix error on unsupported media types

[0.9.2] - 2020-01-30

  • Add support for bmp, psd, xls, xlsx
  • Add tika server live testing
  • Fix error with title guess on image files

[0.9.1] - 2020-01-05

  • Fix issue with APP_DIR creation

[0.9.0] - 2020-01-05

  • Add experimental support for Video-ID
  • Add special handling of YouTube URLs
  • Add support for more Media Types (try & error)
  • Add support for Python 3.8
  • Remove support for Python 3.5

[0.8.2] - 2019-12-22

  • Add new test command for confromance testing
  • Add support for .md (Markdown) files
  • Update to ISCC v1.0.5
  • Update to Apache Tika 1.23
  • Fix issue with non-conformant Meta-ID

[0.8.1] - 2019-12-13

  • Add support for tif files
  • Add support for eps files
  • Set application directory to non-roaming path

[0.8.0] - 2019-11-23

  • Add new dump command (dumps extraction results)
  • Add support for iBooks files
  • Fix error with tika 1.22 dependency
  • Store tika server in non-volatile storage

[0.7.0] - 2019-09-12

  • Expose commands as python API
  • Fix title guessing bug

[0.6.0] - 2019-06-11

  • Added new web command (creates ISCC Codes for URLs)

[0.5.0] - 2019-06-06

  • Added experimental support for aif, mp3, ogg, wav
  • More verbose batch output
  • Fix batch output default Meta-ID

[0.4.0] - 2019-06-03

  • Added support for html, odt, txt, xml, gif
  • Added optional guessing of title (first line of text)
  • Added new info command
  • Fixed wrong detection of identical Instance-ID

[0.3.0] - 2019-06-01

  • Add sim command similarity comparison of ISCC Codes

[0.2.0] - 2019-05-31

  • Add support for doc, docx and rtf documents
  • Update to ISCC 1.0.4 (fixes whitespace bug)

[0.1.0] - 2019-05-31

  • Basic ISCC Code creation
  • Supported file types: jpg, png, pdf, epub

License

MIT © 2019-2021 Titusz Pan

iscc-cli's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

iscc-cli's Issues

Add compare command

Command that takes 2 files as input and returns an ISCC based similarity result.

Feature suggestion: Export entries to .csv

Feature suggestion to include a command that would generate a .csv file from a ISCC generation batch transaction across content in a folder and export detailed entries to .csv.

linux ubuntu cli issue

hey @titusz (and @vingle)

We just ran our first tests of Mova app on Ubuntu today, and we ran into this one.
We might be blocked on deploying to that platform till we can get it sorted. I am not sure how high priority that is for Nic, but would like to find out.

generating iscc for path /home/wesley/1_Projects/lighthouse-obsession.mp4 with title: lighthouse
stderr: Traceback (most recent call last):
File "iscc_cli/cli.py", line 97, in
File "click/core.py", line 829, in call

Error occurred in handler for 'iscc-request': Error: error during iscc generation
at Socket. (/tmp/.mount_mova-0CebFug/resources/app.asar/dist/iscc.js:94:24)
at Socket.emit (events.js:315:20)
at addChunk (internal/streams/readable.js:309:12)
at readableAddChunk (internal/streams/readable.js:284:9)
at Socket.Readable.push (internal/streams/readable.js:223:10)
at Pipe.onStreamRead (internal/stream_base_commons.js:188:23)
stderr: File "click/core.py", line 760, in main
File "click/_unicodefun.py", line 126, in _verify_python3_env
RuntimeError: Click will abort further execution because Python 3 was configured to use ASCII as encoding for the environment. Consult https://click.palletsprojects.com/python3/ for mitigation steps.

This system supports the C.UTF-8 locale which is recommended. You might be able to resolve your issue by exporting the following environment variables:

export LC_ALL=C.UTF-8
export LANG=C.UTF-8

[62677] Failed to execute script 'cli' due to unhandled exception!

iscc gen process exited with code 1

Do not re-download tika server after reboots.

Tika server download and launch is managed by the 'tika' python package. Currently it re-downloads tika server jar after every system reboot. We should skip re-download if tika jar is already available.

Large video files (over 2.14gb) fail with error "Unsupported media type"

I've been running some video tests and hitting the same 'unsupported media type' error for a variety of files - and eventually concluded:

  • the media type (QuickTime/MPEG4) is supported, other than ProRes
  • file sizes larger than somewhere between 2.14 GB and 2.43 GB fail with the same error message

The traceback is below the table.

File Format File size Processing time Outcome
134 min video H264 low bitrate 3.21 GB   Unsupported media type
134 min video H264 mid bitrate 4.96 GB   Unsupported media type
134 min video H264 high bitrate 10.39 GB   Unsupported media type
134 min video MPEG4/3GP 321.9mb 1m 58s 758ms passed
39 min video H264 720p 2.43 GB   Unsupported media type
34 min video H264 mid bitrate 2.6 GB   Unsupported media type
34 min video H264 low bitrate 1.26 GB   passed
34 min video H264 720p 2.14 GB 9m 7s 623ms passed
15 min video H264 low bitrate 359.4mb 2m 22s 226ms passed
15 min video H264 mid bitrate 554.6mb 3m 0s 107ms passed
15 min video H264 high bitrate 1.16gb 5m 0s 478ms passed
15 min video MPEG4/3GP 36mb 12s 211ms passed
15 min video ProRes 9.13 GB   Unsupported media type
15 min video QuickTime DV Pal 3.41 GB   Unsupported media type
15 min video QuickTime DV NTSC 3.41 GB   Unsupported media type
7 min video QuickTime DV Pal 1.68 GB 5m 2s 226ms passed
7 min video QuickTime DV NTSC 1.68 GB 5m 9s 243ms passed
7 min video H264 high bitrate 571.8 mb 2m 31s 938ms passed
7 min video ProRes 4.39 GB   Unsupported media type
1 min video ProRes 596 MB   Unsupported media type

Traceback

  File "/Users/me/.pyenv/versions/3.8.10/bin/iscc", line 8, in <module>
    sys.exit(cli())
  File "/Users/me/.pyenv/versions/3.8.10/lib/python3.8/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/Users/me/.pyenv/versions/3.8.10/lib/python3.8/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/Users/me/.pyenv/versions/3.8.10/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/me/.pyenv/versions/3.8.10/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/me/.pyenv/versions/3.8.10/lib/python3.8/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/Users/me/.pyenv/versions/3.8.10/lib/python3.8/site-packages/iscc_cli/commands/gen.py", line 49, in gen
    title = get_title(tika_result, guess=guess, uri=file.name)
  File "/Users/me/.pyenv/versions/3.8.10/lib/python3.8/site-packages/iscc_cli/utils.py", line 92, in get_title
    mime_type = clean_mime(meta.get("Content-Type"))
AttributeError: 'NoneType' object has no attribute 'get'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.