Git Product home page Git Product logo

Comments (3)

allejok96 avatar allejok96 commented on August 14, 2024

This one is a bit more tricky... There's an API for JW broadcasting, and there's an API for downloading publications. But I haven't seen any API for articles and pages on the website, and I wouldn't think there is any either, because that would be overkill.

That would mean we need a web page scrapper. And that would mean it could break whenever there's an update to the layout etc of the webpage.

I know there's interest in scrapping jw.org, not only for downloading a bunch of audio, but also for things like a jw.org news client for Kodi etc... It would be nice, but it's a bit of a project on its own.

I'll take a look at how the audio recordings are handled, but chances are all solutions are too fragile.

from jw-scripts.

allejok96 avatar allejok96 commented on August 14, 2024

May I ask why you need this, and how Python-savvy you are?

from jw-scripts.

allejok96 avatar allejok96 commented on August 14, 2024

Yeah if you can get hold of the document ID there is an API to download the MP3s... But the kink is to get the ID... I'm giving you an unorthodox quick fix here and it only works for web articles. Tweak it to suit your needs.

#!/usr/bin/env python3
# Run the program with an jw.org URL as an argument to
# download all recordings that are referenced to in that page
import sys, re, urllib.request, json

lang = 'E'
api_url = 'https://apps.jw.org/GETPUBMEDIALINKS?output=json&alllangs=0&fileformat=MP3&langwritten=' + lang + '&txtCMSLang=' + lang + '&docid='
data = urllib.request.urlopen(sys.argv[1]).read().decode('utf-8')
matches = re.finditer('data-page-id="mid([^"]*)"', data)
ids = set(x.group(1) for x in matches)  # set() removes all doubles

for i in ids:
    try:
        print('requesting data about', i)
        response = urllib.request.urlopen(api_url + i)
    except:
        continue

    tree = json.loads(response.read().decode('utf-8'))
    file_url = tree['files'][lang]['MP3'][0]['file']['url']  # Assuming there's only one MP3
    file_title = tree['files'][lang]['MP3'][0]['title']
    file_name = re.sub('[<>:"|?*/\0]', '', file_title) + '.mp3'  # NTFS safe
    print('downloading', file_title)
    urllib.request.urlretrieve(file_url, filename=file_name)

from jw-scripts.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.