Git Product home page Git Product logo

dx's Introduction

dex ⠶ dx

dx is a mathematical indexer for the American Mathematical Society Graduate Studies in Mathematics catalogue dataset (a subproject of dex)

Usage

To get started, follow the instructions for downloading the dataset yourself (or wait for me to package it for distribution), then use the pre-prepared dataset loader module:

from dx.dataset import series_df, abstracts, readerships, reviews, titles, tocs
  • series_df is a pandas DataFrame containing the metadata for multiple book series from AMS (at time of writing 775 books)
  • abstracts, readerships, reviews, titles, tocs are Python lists extracted from this DataFrame [provided for convenience when using this as a dataset].
    • Additionally dx.dataset has dictionaries whose keys are the AMS bookstore's 10 topics (see topics.csv)
      • series_by_subject, abstracts_by_subject, readerships_by_subject, reviews_by_subject, titles_by_subject, tocs_by_subject
>>> titles[0]
'The General Topology of Dynamical Systems'
>>> abstracts[0]
"Topology, the foundation of modern analysis, arose historically as a way to organize ideas like
compactness and connectedness which had emerged from analysis. Similarly, recent work in dynamical
systems theory has both highlighted certain topics in the pre-existing subject of topological
dynamics (such as the construction of Lyapunov functions and various notions of stability) and also
generated new concepts and results (such as attractors, chain recurrence, and basic sets). This book
collects these results, both old and new, and organizes them into a natural foundation for all
aspects of dynamical systems theory. No existing book is comparable in content or scope. Requiring
background in point-set topology and some degree of “mathematical sophistication”, Akin's book
serves as an excellent textbook for a graduate course in dynamical systems theory. In addition,
Akin's reorganization of previously scattered results makes this book of interest to mathematicians
and other researchers who use dynamical systems in their work."

To model the titles with LDA:

from dx.lda.plot_lda_topics import plot_lda
plot_lda()

Topic modelling

So far I've taken a few different approaches to topic modelling (all using Latent Dirichlet Allocation), and the two sources of text are:

  • abstracts (AKA 'blurb')
  • table of contents (chapter and subchapter headings)

TODO: combine both of these for each book into a single corpus and model that.

Each of the following has involved doing a grid search over the value to use for max_df (i.e. what top percentage of most common words to exclude in preprocessing), after which the results can be explored by reviewing the output images.

  • dx.lda.plot_abstracts: Model all text in the abstract for each book
    • This is more effective at removing all 'stopwords'
  • dx.lda.plot_abstracts_by_subject: Model all text in the abstract for each book, one subject area at a time
    • This is more insightful as to the variation within a particular sub-field
  • dx.lda.plot_tocs: Model all chapter/section titles for each book
  • dx.lda.plot_tocs_by_subject: Model all chapter/section titles for each book, one subject area at a time

Doing this is computationally expensive, so multiprocessing is used to run on all available cores (permitting parallel calculation of each LDA model with a dedicated process at 100% per core).

Limitations

The topic modelling was initially limited by the dataset size: I've seen references to 600 being the estimated minimum viable size of a newsgroup dataset for LDA, while shorter documents (e.g. tweets) would be on the order of 5,000 to 10,000.

This limitation motivated the expansion of this project to the entire AMS catalogue beyond just the GSM series, so far reaching around 2,000 titles (see detailed inventory below).

Book series included

This was initially intended to cover the GSM (Graduate Studies in Mathematics) book series, one of my favourite mathematical book series. The catalogue scraped here has expanded to cover other series from the AMS:

  • gsm: Graduate Studies in Mathematics (212 titles)

    "The volumes in this series are specifically designed as graduate studies texts, but are also suitable for recommended and/or supplemental course reading. With appeal to both students and professors, these texts make ideal independent study resources. The breadth and depth of the series coverage make it an ideal acquisition for all academic libraries that support mathematics programs."

  • chel: AMS Chelsea Publishing (220 titles)

    "some of the most important classics that were once out of print available to new generations of mathematicians and graduate students"

  • conm: Contemporary Mathematics (770 titles)

    "high-quality, refereed proceedings written by recognized experts in their fields maintains high scientific standards. Volumes draw from worldwide conferences and symposia sponsored by the American Mathematical Society and other organizations"

  • stml: Student Mathematical Library (91 titles)

    "The AMS undergraduate series, the Student Mathematical Library, is for books that will spark students' interests in modern mathematics and increase their appreciation for research. Books published in the series emphasize original topics and approaches. The step from mathematical coursework to mathematical research is one of the most important developments in a mathematician's career. To make the transition successfully, the student must be motivated and interested in doing mathematics rather than merely learning it."

  • surv: Mathematical Surveys and Monographs (264 titles)

    "detailed expositions in current research fields... survey of the subject along with a brief" "introduction to recent developments and unsolved problems"

  • amstext: AMS Pure and Applied Undergraduate Texts (49 titles)

    "intended for undergraduate post-calculus courses and, in some cases, will provide applications in engineering and applied mathematics. The books are characterized by excellent exposition and maintain the highest standards of scholarship. This series was founded by the highly respected mathematician and educator, Paul J. Sally, Jr"

  • amsip: AMS/IP Studies in Advanced Mathematics (59 titles)

    "jointly published by the AMS and International Press, includes monographs, lecture notes, collections, and conference proceedings on current topics of importance in advanced mathematics. Harvard University Professor of Mathematics Shing-Tung Yau is Editor-in-Chief for the series"

  • cworks: Collected Works (50 titles)

    "presents the substantial body of work of many outstanding mathematicians. Some collections include the complete works of an individual, while others feature selected papers. Readers can follow the major ideas and themes that developed over the course of a given mathematicians career."

  • crmp: CRM Proceedings & Lecture Notes (56 titles)

    "encompasses conference proceedings and lecture notes from important research conferences held at the Centre de Recherches Mathématiques at the Université de Montréal. This series is co-published by the AMS and the Centre de Recherches Mathématiques"

  • dimacs: DIMACS: Series in Discrete Mathematics and Theoretical Computer Science (76 titles)

    "includes conference and workshop proceedings and volumes on education in discrete mathematics and theoretical computer science. Volumes are derived from programs at Rutgers Universitys Center for Discrete Mathematics and Theoretical Computer Science and also sponsored by Princeton University, AT&T Labs Research, Bell Labs (Lucent Technologies), Cancer Institute of New Jersey (CINJ), NEC Research Institute, and Telcordia Technologies."

  • hmath: History of Mathematics (45 titles)

    "compelling historical perspectives on the individuals and communities that have profoundly influenced mathematics development. Each book constitutes a valuable addition to an historical or mathematical book collection. Volumes 4 through 39 were co-published with the London Mathematical Society. From volume 40 on, these volumes are published by the AMS."

  • text: AMS/MAA Textbooks (56 titles)

    "cover all levels of the undergraduate curriculum with a focus on textbooks for upper-division students. They are written by college and university faculty and are carefully reviewed by an editorial board of teaching faculty"

With only a few omitted due to technicalities, the total dataset size is currently 1935 of 1963 titles (so approaches the 2000 title mark, a reasonable size, and about an order of magnitude larger than the initial dataset!)

Reparsing

If you make a change to the parser, run dx.dataset.reparse_all_series() to check it's working as expected, and dx.dataset.reparse_all_series(overwrite_pickles=True) to overwrite them. Alternatively, just overwrite after backing up your pickles (a simple shell script back_up_pickles.sh is included to do so).

  • Batched multiprocessing on all cores is used to speed up reparsing, as for LDA computation (see above)

Extension to MSC

The AMS website includes topics from the Mathematical Subject Classification (MSC) which would be interesting to either validate or to explore through the topic models (i.e. cross-reference the latent topics defined by LDA with the MSC labels).

Extension to subject indexes

Additionally, I'd really like to see the indexes added as the 'documents' for topic modelling (simply removing the page numbers and collapsing the list into a single string would suffice).

This would probably require further preprocessing (but in many cases it's available from images and this can be OCR'd reasonably well with tesseract). That might come more under "labour of love" than I'm currently willing to do!

dx's People

Contributors

lmmx avatar

Watchers

 avatar  avatar

dx's Issues

Deduplicate AMS subpackages

For simplicity, and to avoid a rewrite getting in the way of crawling the extended dataset, I just duplicated the gsm subpackage as chel, surv, conm, stml.

This obviously is not very DRY, and all the modules which are unchanged between these duplicates should be moved into a single shared subpackage which they all then access.

Note that this won’t apply to all files: for instance there will be distinct handling for each series in regards to the product code letters (some series have ‘S’, some have ‘R’, some have ‘H’)

Initially this variation is handled in the Javascript regex in product_code_check.js but that is then handled once again in Python.

Another source of variation is in the default pickle file (looking back, it seems like I ran the crawl.py‘s crawl function manually and then saved the pickle manually too. This should probably be changed to something programmatic now, e.g. an f-string with the series title, the number of books, and the .p extension.

Everything besides these files can go in a shared subpackage - but I’m not in a rush to do this refactor just yet, I’ll enjoy doing it with my full attention at a later date, and from there possibly even expand to covering many if not all of the other series in the AMS catalogue (at which point such rampant duplication would be an outright barrier to continuation and package readability, never mind size).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.