Git Product home page Git Product logo

opensyllabus's People

Contributors

astreylabs avatar denten avatar grahamsack avatar jdmar3 avatar jon-freed avatar jonahsmith avatar mgorenstein avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

opensyllabus's Issues

"Sweeper"

This is the program that will sweep through the files on the server, determine what is not already in the database (based on timestamp?), and add it to MongoDB.

Syllabi Repositories

Texas Syllabus Repositories
Listing of public Texas universities and notes on programmatically locating their syllabi for initial crawling and long-term monitoring. Since each university has its own repository and structure, this is done on the university level.
Long-term plans

Do an initial crawl and add each discovered syllabus URL to a SQLite DB

Future crawls will follow the initial crawl, and compare each found URL to the DB

Already crawled = do nothing

New syllabus = download and add to DB

These only need to occur every 4-6 months so it's hardly onerous on servers

(source: http://www.txhighereddata.org/Interactive/Institutionsshow.cfm?Type=1&Level=1)
Angelo State University
http://www.angelo.edu
Doesn't seem centralized...
Lamar University
http://www.lamar.edu
Central repo: http://sacs.lamar.edu/opa/syllabi/public/lamarsyllabi.php
Midwestern State University
http://www.mwsu.edu
Central repo: http://www.mwsu.edu/profiles/viewcourses.asp
Prairie View A&M
http://www.pvamu.edu
Doesn't seem centralized...
Sam Houston State University
http://www.shsu.edu
Central repo: https://ww2.shsu.edu/faci10wp/
(may take work to crawl)
Stephen F. Austin State University
http://www.sfasu.edu
Doesn't seem centralized...
Sul Ross State University
http://www.sulross.edu
Central repo: http://srinfo.sulross.edu/syllabi/index.php
Tarleton State University
http://www.tarleton.edu
Central repo: http://catalog.tarleton.edu/syllabus/
Texas A&M International
http://www.tamiu.edu
Central repo: http://info.tamiu.edu/courseslist.aspx
Texas A&M
http://www.tamu.edu
http://www.tamug.edu
http://www.tamus.edu
http://www.tarleton.edu/CENTRALTEXAS/
http://www.tamuc.edu
http://www.tamucc.edu
http://www.tamuk.edu
http://www.tamuk.edu/sanantonio/
http://www.tamut.edu
http://www.tsu.edu
http://www.txstate.edu
http://www.tsus.edu
http://www.ttu.edu
http://www.texastech.edu
http://www.twu.edu
http://www.uta.edu
http://www.utexas.edu
http://www.utb.edu
http://www.utdallas.edu
http://www.utep.edu
http://www.utsa.edu
http://www.uttyler.edu
http://www.utpb.edu
http://www.utsystem.edu
http://www.panam.edu
http://www.uh.edu
http://www.uhsa.uh.edu
http://www.uhcl.edu
http://www.uhd.edu
http://www.uhv.edu
http://www.unt.edu
http://www.unt.edu/unt-dallas/
http://www.lawschool.untsystem.edu/
http://www.wtamu.edu
http://www.texaseducationinfo.org/

Question: GridFS vs. BinData -- Depends on File Size

I was reading over the documentation for storing entire files in MongoDB and it suggests that GridFS be used for files above 16MB but that for files below 16MB, you use BinData.

Do we expect to have files above 16MB? It seems like most syllabi are going to be well below this threshold. As such, GridFS may be using a machine gun to kill a fly.

What are the downsides of using GridFS for files below 16MB? It sounds as though the files get broken up into separate pieces and therefore take longer to load. There may be other drawbacks as well.

Thoughts?

Text Extraction: other --> .txt

99% of the files will likely be in pdf, html, of doc/docx, but there may be other file formats...

  • rtf
  • xml
  • Apply pages, etc.

Compare Python PDF extraction libraries with sample files

Write up mini-paper comparing performance of various text-extractors on a document with available plaintext (possibly a particular edition of the bible).

  • Find popular samples with clean and accurate plain text.
    • For each sample, find versions of varying quality.
      • One version should be a PDF I generate from the plain text.
      • One version should be a rich PDF where text is queryable.
      • One version should require OCR.
    • Trim extra info like page number and Project Gutenberg header, if possible.
      • An approach could be to locate the first and last sentence of the text, consider only between these.
      • Or, just leave it be? All extractors will pick up this noise and it's expected in typical use cases.
  • Determine which PDF extractor libraries to test.
    • Definitely PDFMiner since I've already reverse-engineered it programatically.
  • Determine the measures of extraction accuracy.
    • I've used various measures of string difference in the FuzzyWuzzy library with some success.
    • Measure speed.
    • Many of these libraries do layout analysis--is this helpful or not? Surely has an effect on speed.
  • Run the conversions and calculate accuracy.
    • Create a testing suite that determines setting regimes with uniformly better accuracy and use those settings as the benchmark for a particular library.

smarter fs

hashlog

files should be binary hashes with a log manifest containing contextual information like

url, date of creation, date of capture, ingested flag, hash.

The directory should be: 14/45/.......pdf

use a sniffer to check file type?

Extract Citations From Course Syllabus

Note: This is one of several issues related to basic information retrieval from the syllabi. We are assuming in all cases that the extraction is from a .txt document.

Task: Given a syllabus in .txt format, identify and extract the books, articles, or other citations that are mentioned.

Example: Given the following :

"Required Texts
Goethe, Elective Affinities, trans. Constantine (Oxford)
Stendhal, The Red and the Black, trans. Gard (Penguin)
Dostoievsky, Crime and Punishment, trans. Pevear and Volkhonsky (Vintage)
Flaubert, Sentimental Education, trans. Baldick (Penguin)
Tolstoy, Anna Karenina, trans. Pevear and Volkhonsky (Penguin)
Mann, Buddenbrooks, trans. Woods (Vintage)"

Extract the following:

citation1 = {
"authorFirstName": None,
"authorLastName": "Goethe",
"type": "Book",
"title": "Elective Affinities",
"publisher": "Oxford",
"translator": "Constatine"
"publicationDate": None
}

etc.

Extract University From Course Syllabus

Note: This is one of several issues related to basic information retrieval from the syllabi. We are assuming in all cases that the extraction is from a .txt document.

Task: Given a syllabus in .txt format, identify and extract the name of the college or university where the course is being taught.

Twitter scraper

Can someone make a scraper for #syllabus on Twitter? Lots of links to syllabi PDFs.

Extract Title From Course Syllabus

Note: This is one of several issues related to basic information retrieval from the syllabi. We are assuming in all cases that the extraction is from a .txt document.

Task: Given a syllabus in .txt format, identify and extract the title of the course.

Wish List

People:
Brewster Kahle / Board
Craig Calhoun
Dave Weinberger / board
Gary Hall (Coventry), via TB

@matthewwillse has mentioned some friends at CUNY
I'll distribute the list on the Yale ISP list
Ask Biella Coleman to retweet?
Ask Mako Hill to retweet?
Ask Chris Csikszentmihalyi to retweet?
Ask Dan Cohen to retweet?
Anyone at NYU?
Anyone at Fordham?
Anyone at Mozilla?
Suggestions from Josh Greenberg?

Legal-oriented white paper

GitHub Things:
Toolkit of OSP - most useful Python resources
Some Sakai Thing on GitHub

research: HEOA (HEAA?) and risk management + compliance

Website

Website wishlist gDoc URL

Interfaces:
We need a few 'interfaces' like an individual submission process. These are tricky, because our goal isn't to get single syllabi, it's to build positive relationships with people who are down with the OSP.

Add site-monitoring service

Mount drives on startup

Right volumes have to be mounted by hand each time after server restart. Need to automate.

homepage scraper

Work with Kyle form CiteSeer to crawl and ingest potential syllabi.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.