cu-mkp / fieldnotes-restructuring Goto Github PK

Python 10.59% JavaScript 0.07% HTML 84.69% Haskell 1.29% Shell 1.86% CSS 0.18% Makefile 0.03% C 1.28%

fieldnotes-restructuring's Introduction

fieldnotes

Various tasks and records regarding the Making and Knowing Project's fieldnotes produced as part of Project courses and activities of hands-on skillbuilding and historical reconstructions.

Works in tandem with https://github.com/cu-mkp/fieldnotes-content where html content pages for M&K fieldnotes are housed

The Project strives to create an openly-accessible and long-term sustainable online versions of these fieldnotes at http://fieldnotes.makingandknowing.org/.

NOTE: the current landing page is an incomplete index.

From 2014-2017, the Making and Knowing Project recorded fieldnotes using a Columbia University wikispace. As of Fall 2018, this service was no longer supported and all data was exported into a s3 bucket in order to preserve all notes. However, this did not preserve the navigation of the wikispace (as these were simply 'pages' and not actually structured in any sort of hierarchy).

From Fall 2018 onwards, fieldnotes were kept as googledocs (and suite) within the Project Google Drive.

Goals:

create user-friendly and complete index of pages
ensure pages are as complete as possible (internal links, images) and maintain page html in most sustainable way possible

Summary of tasks and progress

For fieldnotes originally from the wiki:

Within the wiki, the pages were organized as follows:
- Semester + Year
  - Student Names
    - Activities
This hierarchy was not preserved but is the desired one in order to organize and understand all the pages archived in the original export to s3.
organize_field_notes.py represents an attempt to recreate the original structure
- Starting from the wiki's original side navigation bar as the highest level, it attempts to recursively follow links down from Semester+Year -> Fieldnotes -> Student Names -> Activities wherever possible
- By creating a tree structure out of these links, this results in something close to the original structure of the wikispace
- Unfortunately, many links are broken, missing, incorrect, or do not follow the typical structure
- In order to document and fix these errors, they are compiled in various CSVs which can be used to manually correct the links where it is obvious where they should point
Another source of complication is the encoding of various special characters, e.g. accented vowels
- It appears that during the many conversions from original documents to wikispace pages to Google Docs to AWS S3 Bucket to Windows files, some of files had their special characters modified, which results in links breaking when, for example, two different encodings of é are used.
Remaining hurdles include:
- fixing HTML hyperlinks so that they point to the correct file in the new structure

fieldnotes-restructuring's People

Contributors

Watchers

fieldnotes-restructuring's Issues

fix links to stylesheet

With new file structure, relative links to the stylesheet for fieldnotes no longer work.
Similarly to #13, we need to fix these.

identify error in URL of GNS field notes

As part of my work on #17, I came across the following field notes by GNS (Njeri Ndungu, SP16) but am experierincing an error with the non-S3 URL.

The S3 Object URL takes me to the content (as it should): https://s3.amazonaws.com/fieldnotes.makingandknowing.org/content/GNS+-+Field+Notes+SP16+-+Mayerne+'or+coleur'++recipe+with+varnish.html

However, the equivalent one using the M&K domain returns an XML error. I copy and paste this into my browser directly from S3 URI: fieldnotes.makingandknowing.org/content/GNS - Field Notes SP16 - Mayerne 'or coleur' recipe with varnish.html which converts into https://fieldnotes.makingandknowing.org/content/GNS%20-%20Field%20Notes%20SP16%20-%20Mayerne%20'or%20coleur'%20recipe%20with%20varnish.html

Could this have something to do with the single quote/tick characters? @gschare, any insights?

Use generated mapping to arrange files into a hierarchical directory structure

Now that we have generated a mostly-complete mapping.csv and found corrected most of the broken links in missing.csv, we are ready to use the mapping to quite simply iterate over each of the rows in the spreadsheet, locate the file named in the first column and rename it (creating intermediate folders along the way) to the second column.
This can be done just using bash scripting, and although the Python script should prevent this, we should throw errors in case we try to rename a file to something that already exists.

assess loose html files in content/ in S3

These are any files that were not outright linked to from the intermediate index pages, and so they have not been reorganized/restructured into the new system.

Go through these to determine what they are and whether they have content that should be preserved and where they belong. (easiest way to do this now is to use https://fieldnotes.makingandknowing.org/content/index.html)
--> @njr2128

After this, if the files SHOULD be preserved, determine method to add them to their proper location and index
--> @gschare (at a later point and probably in a new issue)

Update all URLs inside HTML files to reflect new folder structure

With the new folder structure established in the mapping, it is necessary to update the <a> tag href links in every file to reflect the new locations of the files they point to.
I imagine the simplest algorithm to do this would be:

1. Iterate over the files in the second column of mapping.csv.
2. For each file, parse the HTML content for <a> tags with href attributes pointing to any string in the first column of mapping.csv.
3. For each of those tags, replace the href attribute value with the corresponding new value implied by mapping.csv.

The biggest question is how to parse and update the HTML safely, quickly, and elegantly.

Remaining loose html

The entire list of loose html (as per #17) with NJR assessment are in the attached spreadsheet (also in the GD with the same name)
2021-07-14_S3-content_index.xlsx

There are 3 columns:

File name and link
Notes - identification (if possible), description of contents
Action - with one of three values
a. Remove - don't need this page/file --> should be put into one folder together
b. Add to [Semester+year] -it has been identified and should be added to the listings in that semester and year --> this has been completed by Greg as part of #17
c. [blank] - need to determine what to do with this - see below

The majority of the [blank] action items are either student/staff profiles or teaching resources:

The teaching resources should probably be separated from the content/ directory as they are not fieldnotes proper, though they should be preserved (and reviewed separately by NJR and PHS)
The profiles: either need to be added to their semesters or removed. The staff profiles don't have a related semester, so they perhaps need to be put together or just removed

create field notes tree structure

For the "course archives" links found here: http://fieldnotes.makingandknowing.org/mainSpace/space.menu.html, map out the link tree.
Collapsing the semester+year and then "field notes," organize by each student's name

create tagged version of archived field notes content site

Take archive from S3 as is now and create a tagged version of this repo to correspond to that version of the original archive from wiki in the S3 field notes bucket.
i.e., load all html from that bucket into this github repo as a tagged version.

To be repeated with any new "versions" of S3 bucket, i.e., after #12 and #13.

simplify html of select fieldnotes (GD)

due to pulldown and conversion from GD - retaining the crazy html. Better to simplify for long-term storage, but ok for now.

To do:

Bostock_Safety_Protocol_FA18/SaharBostockSafetyProtocolFA18.html
Min-Lim-Safety-Protocol-FA18

future version of field notes site

Run through CloudFront so that we can point it constantly to the most current version of the fieldnotes repo (like dev site for edition-webpages) so that it is always up to date

add identified loose html files into existing "content/" structure

As per issue #17, loose html files were assessed and a number were identified and should now be added/moved to their correct location in the new content/ S3 structure.

The loose html files have been catalogued in the GD sheet, 2021-07-14_S3-content/index. In the column "Action," files that have been identified have values that start with "Add to" + "[semester+year]". Greg should now go through these and move them to their correct directory based on the information in the "Action" column and any additional info needed from the "Notes" column.

Fix our spreadsheet mess

In our previous attempts at reorganizing the field notes we had several active spreadsheets that were not version controlled and this became very confusing.
This time we have a new plan:

Generate a new spreadsheet documenting missing URLs. Version control this file and use only this one.
Create a method for updating the file without overwriting old corrections.
Go through old spreadsheets documenting missing URLs and corrections and transfer the corrections to the new spreadsheet.

S3 urls vs. actual working ones (character encoding problems)

certain URLs with special characters (particularly spaces, etc) will be handled one way by S3 but another way when actually accessed as a webpage. For example:

In S3:
fieldnotes.makingandknowing.org/mainSpace/David%C2%A0McClure+-+Field+Notes+FA15.html
(notice + for spaces) -- this does not work if you follow the link (404 error)

To actually access the page:
http://fieldnotes.makingandknowing.org/mainSpace/David%C2%A0McClure%20-%20Field%20Notes%20FA15.html
(spaces are represented as %20)

organize field notes from wiki-backup

in order to best complete #3 (which was originally https://github.com/cu-mkp/m-k-manuscript-data/issues/1587), need to organize and describe the files currently found in fieldnotes.makingandknowing.org

Some are named with descriptive info, others have enough info in the file contents themselves, while others are more difficult to decipher.

As possible, organize the notes by semester, year, person (author), and topic

create more complete and user-friendly index for field notes

To include all content from http://fieldnotes.makingandknowing.org/
as well as GD spreadsheet, FA18+other-fieldnotes-list+links
All content found in dedicated s3 bucket in AWS

Need to create a complete index of s3 contents and present in a more user-friendly interface

replace all references back to GD in FA18 field notes

Taken from last part of cu-mkp/m-k-manuscript-data#1199

Outstanding tasks to make FA18 field notes complete for linking in DCE

Steps forward:

for every reference to GD, find equivalent version and replace the link.
Upload corrected (non-GD link) versions to AWS in the proper location in the directory: Amazon S3/fieldnotes.makingandknowing.org/2018-Fall in the correct subfolder.

Confirmed: overwriting the file by reuploading it with the same name is successful (same URL pattern is used)

make index.html for folders in spring 2015

Spring 2015 does not have separate index pages for field notes, assignments, and class notes.
To make these folders work properly, we need index.html pages for each of them.
These must be created manually.

NJR to find correct links for missing

related to #6

Greg's instructions:

Here is the spreadsheet detailing which pages in the field notes wiki appear to be missing (i.e. broken links).
https://github.com/cu-mkp/fieldnotes/blob/main/missing.csv
The information here are the filenames of the page where the missing file was linked, and the name of the missing file itself.
Note that these file names are also the URLs, so you should be able to copy-paste them with the prefix "http://fieldnotes.makingandknowing.org/mainSpace/" and it will work.

If you manage to locate any of these missing files, you can add the name of the missing file as it appears here and the corrected name to the corrections spreadsheet:
https://github.com/cu-mkp/fieldnotes/blob/main/corrections.csv

The "missing.csv" spreadsheet is automatically generated, so any edits on it will be overwritten.
The "corrections.csv" spreadsheet is created manually, but it is very important that you edit it as a CSV and remain conscious of special unicode characters. To this end, it is probably best if you simply copy-paste the HTML-quoted URL or unquoted filename into the spreadsheet rather than typing it out manually, because sometimes seemingly-innocuous spaces are not as they appear...

Once the missing files have been accounted for to our satisfaction or simply ignored, I can get started on moving the files into their new directory structure.

NJR is tracking the full URLs (due to S3 problems noted in #9) in the GD sheet, fieldnotes_20210507_missing