xpmethod / opensyllabus Goto Github PK

View Code? Open in Web Editor NEW

48.0 48.0 10.0 29.97 MB

License: Other

Python 98.76% Java 1.24%

opensyllabus's People

Contributors

Stargazers

Watchers

Forkers

martindale alexduryee ehenneken tissue jon-freed cosmicbboy jonathanreeve sandeep-jay mjlavin80 zer0kerbal

opensyllabus's Issues

"Sweeper"

This is the program that will sweep through the files on the server, determine what is not already in the database (based on timestamp?), and add it to MongoDB.

Text Extraction: .doc / .dox --> .txt

Texas Syllabus Repositories
Listing of public Texas universities and notes on programmatically locating their syllabi for initial crawling and long-term monitoring. Since each university has its own repository and structure, this is done on the university level.
Long-term plans

Do an initial crawl and add each discovered syllabus URL to a SQLite DB

Future crawls will follow the initial crawl, and compare each found URL to the DB

Already crawled = do nothing

New syllabus = download and add to DB

These only need to occur every 4-6 months so it's hardly onerous on servers

(source: http://www.txhighereddata.org/Interactive/Institutionsshow.cfm?Type=1&Level=1)
Angelo State University
http://www.angelo.edu
Doesn't seem centralized...
Lamar University
http://www.lamar.edu
Central repo: http://sacs.lamar.edu/opa/syllabi/public/lamarsyllabi.php
Midwestern State University
http://www.mwsu.edu
Central repo: http://www.mwsu.edu/profiles/viewcourses.asp
Prairie View A&M
http://www.pvamu.edu
Doesn't seem centralized...
Sam Houston State University
http://www.shsu.edu
Central repo: https://ww2.shsu.edu/faci10wp/
(may take work to crawl)
Stephen F. Austin State University
http://www.sfasu.edu
Doesn't seem centralized...
Sul Ross State University
http://www.sulross.edu
Central repo: http://srinfo.sulross.edu/syllabi/index.php
Tarleton State University
http://www.tarleton.edu
Central repo: http://catalog.tarleton.edu/syllabus/
Texas A&M International
http://www.tamiu.edu
Central repo: http://info.tamiu.edu/courseslist.aspx
Texas A&M
http://www.tamu.edu
http://www.tamug.edu
http://www.tamus.edu
http://www.tarleton.edu/CENTRALTEXAS/
http://www.tamuc.edu
http://www.tamucc.edu
http://www.tamuk.edu
http://www.tamuk.edu/sanantonio/
http://www.tamut.edu
http://www.tsu.edu
http://www.txstate.edu
http://www.tsus.edu
http://www.ttu.edu
http://www.texastech.edu
http://www.twu.edu
http://www.uta.edu
http://www.utexas.edu
http://www.utb.edu
http://www.utdallas.edu
http://www.utep.edu
http://www.utsa.edu
http://www.uttyler.edu
http://www.utpb.edu
http://www.utsystem.edu
http://www.panam.edu
http://www.uh.edu
http://www.uhsa.uh.edu
http://www.uhcl.edu
http://www.uhd.edu
http://www.uhv.edu
http://www.unt.edu
http://www.unt.edu/unt-dallas/
http://www.lawschool.untsystem.edu/
http://www.wtamu.edu
http://www.texaseducationinfo.org/

write a white paper on how to author a syllabus in a most useful, open way

plain text
cc license
yaml tags?

need someone to start documentation in the wiki

Question: GridFS vs. BinData -- Depends on File Size

I was reading over the documentation for storing entire files in MongoDB and it suggests that GridFS be used for files above 16MB but that for files below 16MB, you use BinData.

Do we expect to have files above 16MB? It seems like most syllabi are going to be well below this threshold. As such, GridFS may be using a machine gun to kill a fly.

What are the downsides of using GridFS for files below 16MB? It sounds as though the files get broken up into separate pieces and therefore take longer to load. There may be other drawbacks as well.

Thoughts?

Code to push new file to MongoDB

Text Extraction: other --> .txt

99% of the files will likely be in pdf, html, of doc/docx, but there may be other file formats...

rtf
xml
Apply pages, etc.

logging queries

twitter submissions should collect into the hopper

Jonah I'll explain in person

Collaborators to push on

Heyman Center
Harvard (Amy Brand)
Follow up with Matthew Hart and COI @denten

bing scrape

should yield 500k / month

create an opt out page

amazon firewall except for api access

switch to django api

django api should talk to mongo--mongo api is not robust enough

Automate daily / weekly backups

Compare Python PDF extraction libraries with sample files

Write up mini-paper comparing performance of various text-extractors on a document with available plaintext (possibly a particular edition of the bible).

Find popular samples with clean and accurate plain text.
- For each sample, find versions of varying quality.
  - One version should be a PDF I generate from the plain text.
  - One version should be a rich PDF where text is queryable.
  - One version should require OCR.
- Trim extra info like page number and Project Gutenberg header, if possible.
  - An approach could be to locate the first and last sentence of the text, consider only between these.
  - Or, just leave it be? All extractors will pick up this noise and it's expected in typical use cases.
Determine which PDF extractor libraries to test.
- Definitely PDFMiner since I've already reverse-engineered it programatically.
Determine the measures of extraction accuracy.
- I've used various measures of string difference in the FuzzyWuzzy library with some success.
- Measure speed.
- Many of these libraries do layout analysis--is this helpful or not? Surely has an effect on speed.
Run the conversions and calculate accuracy.
- Create a testing suite that determines setting regimes with uniformly better accuracy and use those settings as the benchmark for a particular library.

start a DB of users

who has dev access?
who has keys?

Move to Debian from Amazon Linux

Amazon Linux is CentOS based. Moving to Debian based wile it is still easy. Most of local expertise here is Debian.

crawling ingestor metadata log file

source url, date of creation, data of capture, flag?, hash

demo of full text search on a subset of the corpus

We would like to demo full text search from the web side. Could we take relatively clean subset of 100k or so documents and get that to work?

Extract or Identify the Department of Course Syllabus

smarter fs

files should be binary hashes with a log manifest containing contextual information like

url, date of creation, date of capture, ingested flag, hash.

The directory should be: 14/45/.......pdf

use a sniffer to check file type?

Add users from Ted, fix permissions

Extract Citations From Course Syllabus

Note: This is one of several issues related to basic information retrieval from the syllabi. We are assuming in all cases that the extraction is from a .txt document.

Task: Given a syllabus in .txt format, identify and extract the books, articles, or other citations that are mentioned.

Example: Given the following :

"Required Texts
Goethe, Elective Affinities, trans. Constantine (Oxford)
Stendhal, The Red and the Black, trans. Gard (Penguin)
Dostoievsky, Crime and Punishment, trans. Pevear and Volkhonsky (Vintage)
Flaubert, Sentimental Education, trans. Baldick (Penguin)
Tolstoy, Anna Karenina, trans. Pevear and Volkhonsky (Penguin)
Mann, Buddenbrooks, trans. Woods (Vintage)"

Extract the following:

citation1 = {
"authorFirstName": None,
"authorLastName": "Goethe",
"type": "Book",
"title": "Elective Affinities",
"publisher": "Oxford",
"translator": "Constatine"
"publicationDate": None
}

etc.

tweet back once ingested

Remove keys on login

citation detection against citeseer / DPCP

Search for each citation in citeseer DB (20mil) through SOLR to find a list of potential documents that contain citation.

End of Semester Goals

UNC End of Semester Goals? TNS?
Teams submit goals lists

Extract University From Course Syllabus

Note: This is one of several issues related to basic information retrieval from the syllabi. We are assuming in all cases that the extraction is from a .txt document.

Task: Given a syllabus in .txt format, identify and extract the name of the college or university where the course is being taught.

add staging server

Prod: Code

Text Extraction: pdf --> txt

There are a few pre-existing python packages for this...

pypdf
slate
pdfminer

move the mike's corpus from the old server to the new one

look into version controlling the database as a whole

Text Extraction: html --> txt

add solr

implement a hopper for ingestion

Crawl goes into hopper first. Ingest script to hash, test, and write to main archive.

Twitter scraper

Can someone make a scraper for #syllabus on Twitter? Lots of links to syllabi PDFs.

move MongoDB to the 500gb block

We are backing up the mounted volume daily. We should make sure MongoDB is being backed up as well.

split github into repos

sanitize names

Gerbal OSP?

Folks, who knows what this is? https://github.com/gerbal/OSP
Do we need to move any of it into our GitHub?

Extract Title From Course Syllabus

Note: This is one of several issues related to basic information retrieval from the syllabi. We are assuming in all cases that the extraction is from a .txt document.

Task: Given a syllabus in .txt format, identify and extract the title of the course.

Wish List

People:
Brewster Kahle / Board
Craig Calhoun
Dave Weinberger / board
Gary Hall (Coventry), via TB

@matthewwillse has mentioned some friends at CUNY
I'll distribute the list on the Yale ISP list
Ask Biella Coleman to retweet?
Ask Mako Hill to retweet?
Ask Chris Csikszentmihalyi to retweet?
Ask Dan Cohen to retweet?
Anyone at NYU?
Anyone at Fordham?
Anyone at Mozilla?
Suggestions from Josh Greenberg?

Legal-oriented white paper

GitHub Things:
Toolkit of OSP - most useful Python resources
Some Sakai Thing on GitHub

research: HEOA (HEAA?) and risk management + compliance

Website

Website wishlist gDoc URL

Interfaces:
We need a few 'interfaces' like an individual submission process. These are tricky, because our goal isn't to get single syllabi, it's to build positive relationships with people who are down with the OSP.

Add site-monitoring service

xpmethod / opensyllabus Goto Github PK

opensyllabus's People

Contributors

Stargazers

Watchers

Forkers

opensyllabus's Issues

Recommend Projects

Recommend Topics

Recommend Org