Git Product home page Git Product logo

antenati's People

Contributors

dependabot[bot] avatar gcerretani avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

antenati's Issues

Antenati update overnight changed the URL's

There was an antenatal update last night. The new URL's are in this format:
https://www.antenati.san.beniculturali.it/ark:/12657/an_ua36075040?lang=en

It looks like you can still extract the numerical code after the "an_ua" and use it in the old URL format, like:
https://www.antenati.san.beniculturali.it/detail-registry/?s_id=36075040
but the IIIF manifest appears to be unavailable.

The IIIF manifest for the link above is at the below URL, but appears to be secured:
https://dam-antenati.san.beniculturali.it/antenati/containers/046by1e/manifest

Using Chrome, you can find the 7-character identifier code from the manifest for each page, and use the existing content link URL, replacing the 7-character code, to download the page:
https://iiif-antenati.san.beniculturali.it/iiif/2/xxxxxxx/full/full/0/default.jpg

But, the script will not longer work without an update. Since it looks like the manifest may not be available, maybe there's a way to programmatically inspect the page to pull all the codes that would be in the manifest, and download them.

I can also give this info:
This is one of the older links to an 1860 birth record book:
https://www.antenati.san.beniculturali.it/detail-registry/?s_id=1092642&lang=en
It translates to this new link:
https://www.antenati.san.beniculturali.it/ark:/12657/an_ua1092642/5dgG4e3

EDIT:
If I use the Chrome plug-in "Save all resources", it will save all the files loaded, INCLUDING the complete IIIF manifest. In the example page:
https://www.antenati.san.beniculturali.it/ark:/12657/an_ua36075040?lang=en
Where the manifest is at:
https://dam-antenati.san.beniculturali.it/antenati/containers/046by1e/manifest
The plug-in will save the file manifest.html, which is the complete IIIF manifest. I can't find the correct URL to load the manifest myself, though. Adding ".html" to the manifest link does not work, even though that's the path given by the plug-in.

Sometimes the software suddenly fails (Mac OS)

git fetch origin --prune
git pull
python3 antenati.py "https://www.antenati.san.beniculturali.it/detail-registry/?s_id=19135478"
Traceback (most recent call last):
File "/Users/kordan/Desktop/ALBERI/antenati/antenati.py", line 201, in
main()
File "/Users/kordan/Desktop/ALBERI/antenati/antenati.py", line 186, in main
downloader = AntenatiDownloader(args.url)
File "/Users/kordan/Desktop/ALBERI/antenati/antenati.py", line 31, in init
self.manifest = self.__get_iiif_manifest(self.archive_url)
File "/Users/kordan/Desktop/ALBERI/antenati/antenati.py", line 63, in __get_iiif_manifest
raise RuntimeError(f'{url}: HTTP error {http_reply.status}')
RuntimeError: https://www.antenati.san.beniculturali.it/detail-registry/?s_id=19135478: HTTP error 403

invalid syntax on line 41

Hi.

i'm running the script on macos 10.13 with latest python3.

when running the command, like:

python3 antenati.py https://www.antenati.san.beniculturali.it/detail-registry/?s_id=35390357

i always get a syntax error like the following:

File "antenati.py", line 41
    raise RuntimeError(f'Cannot get archive ID from {url}')
                                                         ^
SyntaxError: invalid syntax

running the same script from a macos 11.4 works perfectly fine, so i can't figure out what's happening...

Feature Request: Pages

I'd like to request a feature where you could specify the start and end page numbers, and have it just download those images. Would be very helpful to just grab the last 8 pages of an index, for example.

Configuring installation scheme with distutils config files is deprecated

In Mac OS
cd ~/Desktop
git clone https://github.com/gcerretani/antenati.git antenati
cd antenati
pip3 install -r requirements.txt
DEPRECATION: Configuring installation scheme with distutils config files is deprecated and will no longer work in the near future. If you are using a Homebrew or Linuxbrew Python, please see discussion at Homebrew/homebrew-core#76621

which python
/usr/bin/python

Quindi sto usando il python preinstallato da apple.
E' un problema del tuo software o del mio mac?

Script fails where multiple years share same gallery ID

Some state archives do not create separate galleries for each year and record type. E.g. Modugno (BA) Stato Civile Napoleonico uses one gallery number for all births, one for all deaths, etc. Image numbers are not reused - e.g. 1814 deaths might contain images 1-100 and 1815 deaths contains images 101-200

When the script encounters an already existing subdirectory created in a previous run, an error is thrown and the script terminates:

~/Documents/Antenati/Modugno $ antenati.py http://dl.antenati.san.beniculturali.it/v/Archivio+di+Stato+di+Bari/Stato+civile+napoleonico/Modugno/Morti/1815/005619901_02177.jpg.html
Traceback (most recent call last):
File "/usr/bin/antenati.py", line 74, in
main()
File "/usr/bin/antenati.py", line 52, in main
os.mkdir(splitting[13])
OSError: [Errno 17] File exists: '005619901'

Maybe the test for a duplicate name shouldn't be in the creation of the subdirectory but rather in looking to see if a downloaded file is going to overwrite a file in the target directory with the same name?

In the meantime I got around it by renaming the subdirectory after a run is finished.

HTTP error 403

Hello, I'm running the antenati script since long time. I haven't used it in a while and now I'm always getting a runtime error:

python3 antenati.py https://antenati.cultura.gov.it/ark:/12657/an_ua35390363/5BxaADy
Traceback (most recent call last):
  File "/Users/jetmcquack/Documents/Genealogia/tool antenati/antenati.py", line 231, in <module>
    main()
  File "/Users/jetmcquack/Documents/Genealogia/tool antenati/antenati.py", line 215, in main
    downloader = AntenatiDownloader(args.url, args.first, args.last)
  File "/Users/jetmcquack/Documents/Genealogia/tool antenati/antenati.py", line 42, in __init__
    self.manifest = self.__get_iiif_manifest(self.url)
  File "/Users/jetmcquack/Documents/Genealogia/tool antenati/antenati.py", line 104, in __get_iiif_manifest
    raise RuntimeError(f'{url}: HTTP error {http_reply.status}')
RuntimeError: https://antenati.cultura.gov.it/ark:/12657/an_ua35390363/5BxaADy: HTTP error 403

I'm doing something wrong or is there an issue?

Need option to use an existing directory

If an archive for a particular year (example: Modugno births 1866) has multiple subfolders (in this case Parte 1 and Parte 2), when executing the script for the first image in Parte 1 if no folder exists, a folder is created on the local drive and the images are fetched, adding "Parte 1" before the folio number and image number in the filename. When trying to execute the script for the first image in Parte 2, the script tells you the folder already exists and exits.

I added some logic when a folder already exists to ask the user if he wants to continue using click.confirm(), with an abort option. It seems to work although the abort is less than elegant (I admit to being a hack at programming). I've attached a zip antenati2.py.zip with the updated code. May require installation of click through pip.

Here's the added code (also the addition of "import click")

    if os.path.exists(foldername):
        click.confirm("Directory " + foldername + " already exists.Do you want to copy images to this directory?", default=True, abort=True)    

    if not os.path.exists(foldername):
	os.mkdir(foldername)

    os.chdir(foldername)

Below is the output with the option enabled (yes or enter is the default and will continue with images pulled into the existing directory).

Directory Modugno_Morti_1866 already exists.Do you want to copy images to this directory? [Y/n]: n
Traceback (most recent call last):
  File "/usr/bin/antenati2.py", line 86, in <module>
    main()
  File "/usr/bin/antenati2.py", line 60, in main
    click.confirm("Directory " + foldername + " already exists.Do you want to copy images to this directory?", default=True, abort=True)
  File "~/.local/lib/python2.7/site-packages/click/termui.py", line 181, in confirm
    raise Abort()
click.exceptions.Abort

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.