Git Product home page Git Product logo

alucic2 / cluster_htrc Goto Github PK

View Code? Open in Web Editor NEW
0.0 3.0 0.0 246 KB

Identifying the boundaries of main content of fiction and non-fiction works in the HathiTrust Extracted Features dataset.

License: Apache License 2.0

Jupyter Notebook 100.00%
clustering-analysis clustering-algorithm smoothing-methods scanned-documents digital-libraries extracting-features detecting-paratext-boundaries

cluster_htrc's Introduction

Identification of main content in the works included in the HathiTrust Extracted Features dataset

Code for clustering digitized pages of the works based on the features that are available through the HathiTrust Extracted Features dataset v.2.0 with the aim of separating main content of a work from paratextual elements. Reference: A. Lucic, R. Burke and J. Shanahan, "Unsupervised Clustering with Smoothing for Detecting Paratext Boundaries in Scanned Documents," 2019 ACM/IEEE Joint Conference on Digital Libraries (JCDL), 2019, pp. 53-56, doi: 10.1109/JCDL.2019.00018. The conference paper is available here

Running the code

Several python libraries that are required for running the code are included in the requirements.txt file. The code depends on the methods developed under the htrc-feature-reader python library. This library can be installed through pip or conda package manager: pip install htrc-feature-reader or conda install -c htrc htrc-feature-reader

Motivation for the development of this method

This work developed as part of the Reading Chicago Reading project at DePaul University in 2018. The HathiTrust Research Center Advanced Collaborative computational support grant that the project received allowed us to explore a set of in copyright and out of copyright fiction and non-fiction works related to the analysis of the One Book One Chicago program that were included in the Extracted Features dataset. To be able to limit the extraction of text features to main content of the work we needed to establish where the main content begins and ends in the digitized pages. If paratext elements such as Table of Contents, Epilogue, Bibliography, Critical Introduction are not excluded before extracting text measures from non-fiction and fiction works, these elements can skew the metrics obtained from the work (e.g. count of locations or personal names in the work). Paratext boundaries are not a consistent metadata element that accompany digital files included in digital libraries. Even if such information exists in the accompanying metadata files, this information needs to be verified.

Modeling paratext as the outlier of main work

The conclusion of the work was that paratext elements lend themselves to being modeled as outliers of main work. As the amount of paratext increases in a volume, however, it is harder to establish the beginning and end of the main content.

Acknowledgment

We thank HathiTrust Research Center for the Advanced Collaborative Support grant and for the use of the HathiTrust Research Data Capsule.

Future work

We plan to continue developing this method to establish the upper bounds of accuracy with which paratext elements can be identified and excluded from digital files. We also plan to explore the degree to which different paratext elements lends themselves to being identified in a work using automated methods.

cluster_htrc's People

Contributors

alucic2 avatar

Watchers

 avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.