Git Product home page Git Product logo

iso-ics-codes-scripts's Introduction

ISO ICS Codes Scripts

Scripts to fetch ISO ICS codes from https://www.iso.org/standards-catalogue/browse-by-ics.html into a static data file.

Usage:

ruby ics_scrapper.rb

The script uses 3 threads by defaul, but it’s possibel passing param to chahge number of threads from 1 to 3:

ruby ics_scrapper.rb 2

Progress info outputs during fetching data:

 Parse 872 of 1381 Queue: 589 Threads: 3

Data files stored in ./ics direcrtory.

You can get feched data files form https://github.com/metanorma/iso-ics-codes repo.

iso-ics-codes-scripts's People

Contributors

andrew2net avatar ronaldtse avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

iso-ics-codes-scripts's Issues

Make code multi-threaded

Because scraping takes time, we should make this multi-threaded and display progress. The threading model should use the publisher / consumer model.

The initial scrape of the top level page will indicate how many fields there are. Put these into a central Set/Array that is shared amongst threads.

Once we have this, then can start multiple threads to consume from this set, and then push additional fetch URLs to the shared array.

Each successful fetch should immediately serialize the JSON to disk to ensure that partial progress is not lost.

The command line to start this process should take an additional option to determine number of threads to start.

Migrate this repository to github.com/metanorma

The repository github.com/metanorma has been set up to host all Metanorma related work. This is one such piece of work. While the migration will be slow (there is a lot to do, and a lot of interdependencies), we need to schedule migration for all related repositories.

Migration to Metanorma:

  • Repost the repository to http://github.com/metanorma/X
  • If this is a Ruby gem, change the address of the repository in the *.gemspec
  • Delete all files but the Readme in the riboseinc repository
  • Replace the Readme in the riboseinc repository with the message "Repository migrated to http://github.com/metanorma/X
  • Update the links to the repository in any documentation you maintain
  • Notify @opoudjis and @strogonoff when you have migrated the repository. They will then update the links to the repository in any documentation they maintain.

Structure ICS codes from iso.org as JSON-LD data set

ISO provides this page https://www.iso.org/standards-catalogue/browse-by-ics.html, for which all the ICS codes and their descriptions can be retrieved from.

Intro to ICS

ICS is a hierarchical classification which consists of three levels:

  1. Field
  2. Group
  3. Sub-group

A Field has multiple Groups. A Group may have zero or more Sub-groups.

The general structure of an "ICS entry" (Every Field, Group and Sub-group are entries):

  • type: field|group|subgroup
  • code: integer (2-digit field, 3-digit group, 2-digit subgroup)
  • description: string
  • notes: array of Notes (only in groups and subgroups)

A Note can be a string or a string with link, e.g.,:

  • plain string: Including software development, documentation and use. In JSON we can represent it with { "text": "Including software development, documentation and use"}
  • string with link to another ICS code: Internet applications, see 35.240.95. In JSON we can represent it with { "text": "Internet applications, see {ics-code}", "ics-code": "35.240.95"}

Retrieving ICS Data

From the top level page (https://www.iso.org/standards-catalogue/browse-by-ics.html) you can scrape all Fields and their links:

e.g.,

ICS	Field
01	Generalities. Terminology. Standardization. Documentation
03	Services. Company organization, management and quality. Administration. Transport. Sociology
...

Each ICS entry has its own page (which does a redirect): https://isoics.org/ics/[field].html. e.g., The ICS code 01 links to the ICS Field page https://isoics.org/ics/01.html.

On each ICS Field page (e.g., 35), it contains information about all its Groups
(https://isoics.org/ics/35.html):

ICS	Field
35.020	Information technology (IT) in general 
    Including general aspects of IT equipment
35.030	IT Security 
    Including encryption
35.040	Information coding 
    Including coding of audio, picture, multimedia and hypermedia information, bar coding, etc. 
    IT Security, see 35.030
35.060	Languages used in information technology
35.080	Software 
    Including software development, documentation and use 
    Internet applications, see 35.240.95

Each ICS Group contain a code (35.080), a description (Software), zero or more notes (Including software development, documentation and use, Internet applications, see 35.240.95).

Each ICS Group also has a link, https://isoics.org/ics/[field].[group].html, e.g., https://isoics.org/ics/33.040.html

On the ICS Group page, you can find all information about its subgroups (or itself if there is no subgroup). e.g., https://isoics.org/ics/33.040.html:

ICS	Field
33.040.01	Telecommunication systems in general
33.040.20	Transmission systems 
    Including synchronization, cable systems, integrated cabling, pathways and multiplexing
33.040.30	Switching and signalling systems 
    Including telecommunication call charging and billing aspects
33.040.35	Telephone networks 
    Including Public Switched Telephone Networks (PSTN), Private Telecommunication Networks (PTN) and Private Integrated Service Networks (PISN)
33.040.40	Data communication networks 
    Including Packet Switched Public Data Networks (PSPDN) and Ethernet 
    Integrated Services Digital Network (ISDN), see 33.080 
    Networking, see 35.110 
    IT terminal and other peripheral equipment, see 35.180

Here you can see that the "33.040.40" code has:

  • description: "Data communication networks"
  • notes:
[
    "Including Packet Switched Public Data Networks (PSPDN) and Ethernet"
    "Integrated Services Digital Network (ISDN), see 33.080"
    "Networking, see 35.110"
    "IT terminal and other peripheral equipment, see 35.180"
  ]

In conclusion, we can scrape all this data by visiting only three levels of pages:

  • Root page https://www.iso.org/standards-catalogue/browse-by-ics.html
  • Field page https://isoics.org/ics/[field].html
  • Group page https://isoics.org/ics/[field].[group].html

The resulting format should be put in JSON for each entry.

Example sub-group (33.040.40.json):

{
  "code": "33.040.40", 
  "description": "Data communication networks",
  "descriptionFull": "Telecommunication systems. Telecommunications. Audio and video engineering. Data communication networks",
  "notes": [
    { "text": "Including Packet Switched Public Data Networks (PSPDN) and Ethernet" },
    { "text": "Integrated Services Digital Network (ISDN), see {ics-code}", "ics-code": "33.080" },
    { "text": "Networking, see {ics-code}", "ics-code": "35.110" },
    { "text": "IT terminal and other peripheral equipment, see {ics-code}", "ics-code": "33.180" },
  ]
}

Applying JSON-LD

https://isoics.org/jsonld/field.jsonld

{
  "@context":
  {
    "code": "https://isoics.org/ics/ns#code",
    "fieldcode": "https://isoics.org/ics/ns#fieldcode",
    "description": "https://isoics.org/ics/ns#description",
    "descriptionFull": "https://isoics.org/ics/ns#descriptionFull"
  }
}

https://isoics.org/ics/group.jsonld

{
  "@context":
  {
    "code": "https://isoics.org/ics/ns#code",
    "fieldcode": "https://isoics.org/ics/ns#fieldcode",
    "groupcode": "https://isoics.org/ics/ns#groupcode",
    "description": "https://isoics.org/ics/ns#description",
    "descriptionFull": "https://isoics.org/ics/ns#descriptionFull",
    "notes": {
      "@id": "https://isoics.org/ics/ns#notes",
      "@container": "@list"
    }
  }
}

https://isoics.org/ics/subgroup.jsonld

{
  "@context":
  {
    "code": "https://isoics.org/ics/ns#code",
    "fieldcode": "https://isoics.org/ics/ns#fieldcode",
    "groupcode": "https://isoics.org/ics/ns#groupcode",
    "subgroupcode": "https://isoics.org/ics/ns#subgroupcode",
    "description": "https://isoics.org/ics/ns#description",
    "descriptionFull": "https://isoics.org/ics/ns#descriptionFull",
    "notes": {
      "@id": "https://isoics.org/ics/ns#notes",
      "@container": "@list"
    }
  }
}

https://isoics.org/ics/note.jsonld

{
  "@context":
  {
    "text": "https://isoics.org/ics/ns#noteText",
    "ics-code": "https://isoics.org/ics/ns#code"
  }
}

Example sub-group (33.040.40.json):

{
  "@context": "https://isoics.org/ics/ns/subgroup.jsonld",
  "fieldcode": "33", 
  "groupcode": "040", 
  "subgroupcode": "40", 
  "code": "33.040.40", 
  "description": "Data communication networks",
  "descriptionFull": "Telecommunication systems. Telecommunications. Audio and video engineering. Data communication networks",
  "notes": [
    { "text": "Including Packet Switched Public Data Networks (PSPDN) and Ethernet" },
    { "text": "Integrated Services Digital Network (ISDN), see {ics-code}", "ics-code": "33.080" },
    { "text": "Networking, see {ics-code}", "ics-code": "35.110" },
    { "text": "IT terminal and other peripheral equipment, see {ics-code}", "ics-code": "33.180" },
  ]
}

UPDATED: added descriptionFull key

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.