This program transcribes images of a handwritten fieldbook into text. It does it with three different services:
- Microsoft Azure Cognititive Services
- Amazon Web Services
- Google Cloud Platform.
The results are stored in a Google sheet. See the 'contrasted_pages' sheet.
Transcribe the writing in the fieldbook with high quality results.
- Test techniques to improve the results of transcription, such as increasing contrast of images.
- Compare the quality of transcription by three companies at this time.
There is a blog post about this process.
git.ipynb - git commands to commit and push work done on Google Colab
fieldbook_antioch_1_get_images.ipynb - Download, contrast and inspect images before transcription fieldbook_antioch_extract_text.ipynb - Trancribe images
Images are stored in these folders:
The parent folder.
Raw images downloaded from Ochre.
Some raw images are scans of an open book with two pages. These images are split into pages 1 and 2 and saved here.
To try to improve transcription results, images are converted into black and white. Sometimes this improves results, other times the results are worse. It depends on the darkness of the pencil lines.
The notebook uses Google Colab, Google Drive and Google Sheets.
There are 4 sections:
- Google sheet - Get handle on it
- Download raw images
- Split images
- Contrast images
The sections are described below.
Installs gspread, authenticates and opens the fieldbook_pages worksheet.
This routine loops through each row of the spreadsheet.
The image is downloaded from its URL in column D.
Example: https://pi.lib.uchicago.edu/1001/org/ochre/f5337e52-97e4-4251-8d85-22aca943d220&load
The image is saved with file name using column C.
Example: 1932-002-0000. "ANT_FB_" is prepended to match the convention used previously and the extension is .jpg
ANT_FB_1932-002-0000.jpg
This routine loops through each row of the spreadsheet. Some raw images are scans of an open book with two pages. This routine splits a wide image into pages 1 and 2 and saves the files. The spreadsheet is updated with the file names of the pages in columns I and K. Narrow images are left as 1 page.
There is a form to check each image. The pages are displayed to check the results of the split.
This routine loops through each row of the spreadsheet. The process reads each .jpg file in the raw_split_pages folder. It saves a contrasted version in the contrasted_split_pages folder.
There is a form to check each image. The original page and contrasted image is displayed to check the results of the contrast. The contrast can be re-run for a row of pages and the threshold vlaue adjusted.
graph TD
D1["Get handle on spreadsheet"]-->D2["Download each image URL in column D."]
D2-->D3["Split wide images into 2 pages."]
D3-->D4["Inspect page split images"]
D4-->D5["Contrast images."]
D5-->D6["Inspect each contrasted image."]
D6-->D7{"Contrast good?"}
D7-->|Yes|D6
D7-->|No|D8["Change threshold and re-run contrast for image."]
D8-->D7
The notebook uses Google Colab, Google Drive and Google Sheets.
There are five sections:
- Google sheet - Get handle on it
- Azure Cognitive Services
- AWS
- GCP
- Write HTML files
The sections are described below.
Installs gspread, authenticates and opens the fieldbook_pages worksheet.
Each of the three services runs in the same manner.
graph TD
T1["Install and import packages"]-->T2["Load parameters from a .json file"]
T2-->T3["Store parameters such as the secret access key and endpoint URL in variables."]
T3-->T4["Set a function to call the service to transcribe an image using its path.<br>example: def get_text_from_image_gcp(path)"]
T4-->T5["Interate the list of contrasted image files in a directory."]
T5-->T6["Call function to transcribe using the path of an image"]
T6-->T7["Update contrasted_pages worksheet with text returned from function."]
T7-->T5
This cell loops through the rows of the contrasted_pages worksheet and saves them into html pages.