Git Product home page Git Product logo

white-whale's Introduction

White Whale Readme

Overview

White Whale is a tool for comparing the works of an author. The repetition of ideas, language, and allusions have struck me, as I have made my way through Herman Melville's complete works. Not only can you draw lines between the works on topics like sailing, searching, and travel, but in classical allusions to Ovid and Homer in White Jacket and Mardi or Biblical allusions in Moby-Dick and Clarel (which unfortunately could not be included in the search. See below). I thought it would be helpful to new and seasoned academics and admirer's of literature to be able to quickly cross reference between and within works to aid their scholarship and understanding of these texts.

For more about White Whale and the authors included in the search, visit the about page.

Technical Brief

Originally, White Whale's search was based off of archive.org and Open Library's APIs. Unfortunately this led to slow searches, and working with the API's can be quite difficult, so I have since rebuilt a search algorithm from scratch using TypeScript and implemented it in place of the API.

The front-end is built with Next.js and the Bulma.io CSS framework. The data is stored in a noSQL database with Macrometa.

Archive.org APIs

The API documentation on Open Library is a bit opaque and out of date. I originally intended to use a combination of their search and search inside APIs to allow users to search for an author they were interested in and then search within all of their books. I quickly realized this wasn't possible as there are many editions of a given book and furthermore the editions are not always labeled correctly or well. That left me with hand picking volumes and using the search inside API.

The documentation gives the parameters and an example call, but doesn't explain what many of the parameters are or where one would find them. The parameters needed in addition to the search query are

  • hostname
  • item_id
  • doc
  • path All of these parameters are available with the item_id by querying archive.org/metadata/item_id and parsing the response.

The documentation fails to note a few things about this query however. It does note that the doc parameter is often, but not always the same as the item_id. doc is not an item in the metadata, but I was able to discover through trial and error that books that were scanned and processed with Djvu and that had a Djvu XML file associated with their metadata, have a separate doc value. This doc value is retrieved from the filename of the Djvu XML file name by removing the file format information.

That was the first hurdle I came across in working with the API. The second came from wanting to link directly to the page in a given book that a match in a search comes from. What I assumed to be a fairly simple task became one of the most time consuming elements of the project because the page numbers given from the search inside API do not line up with the page numbers used in the URL structure archive.org uses to navigate between the pages of a book.

I could not find documentation on this issue anywhere, but it turns out that the search inside API gives you the leaf number that a match appears on. This leaf number is the number of scanned images inside the book you are in, but the URL structure uses a separate page number that is based on the page numbers actually included inside the book. Therefore, things like the title page, copyright page, forward, introduction, and any other page without a number, are not found with a page number. Instead they are referenced by the string nx where xx is leaf number minus one (because the first leaf of a book doesn't have any reference number). Note, sometimes books do have page numbers, but they aren't included in the metadata for some reason, so every page is referenced by nx.

You are able to query this page number information by using the following API call:

https://hostname/BookReader/BookReaderJSIA.php?id=item_id&itemPath=path&server=hostname&format=json&subPrefix=doc

What it returns is not easy to navigate or index into, so I opted to use the information given to recreate a simple JSON file to store leaf numbers against their page numbers. One other thing to watch out for, that I still do not understand, is that some books skip leaf numbers, which means you cannot assume leaf numbers are accurate to the number of scanned images in the book.

Other Things I Learned

This was a learning project for me. I had never used Node.js, Next.js, or Bulma before. In addition, I had never built anything that relied on APIs and had never built an API before. I learned a lot about asynchronous JavaScript, parsing and building JSON objects and writing files. I specifically chose to not use a database and instead rely on JSON files as a learning experience and because of the relative simplicity of the need data storage.

I initially developed the program to store data as local JSON files, without realizing that when in production Next.js does not support writing files to the server. In order to keep the same basic data structures, I switched to a noSQL database for storage.

This is the first major project where I have used a testing library to build tests as I went along. There are still some inconsistencies in the code - for example sometimes I used React classes for components and other times used functional components that I would like to clean up when I have the time.

white-whale's People

Contributors

dernin avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.