xpmethod / opensyllabus Goto Github PK
View Code? Open in Web Editor NEWLicense: Other
License: Other
This is the program that will sweep through the files on the server, determine what is not already in the database (based on timestamp?), and add it to MongoDB.
Texas Syllabus Repositories
Listing of public Texas universities and notes on programmatically locating their syllabi for initial crawling and long-term monitoring. Since each university has its own repository and structure, this is done on the university level.
Long-term plans
Do an initial crawl and add each discovered syllabus URL to a SQLite DB
Future crawls will follow the initial crawl, and compare each found URL to the DB
Already crawled = do nothing
New syllabus = download and add to DB
These only need to occur every 4-6 months so it's hardly onerous on servers
(source: http://www.txhighereddata.org/Interactive/Institutionsshow.cfm?Type=1&Level=1)
Angelo State University
http://www.angelo.edu
Doesn't seem centralized...
Lamar University
http://www.lamar.edu
Central repo: http://sacs.lamar.edu/opa/syllabi/public/lamarsyllabi.php
Midwestern State University
http://www.mwsu.edu
Central repo: http://www.mwsu.edu/profiles/viewcourses.asp
Prairie View A&M
http://www.pvamu.edu
Doesn't seem centralized...
Sam Houston State University
http://www.shsu.edu
Central repo: https://ww2.shsu.edu/faci10wp/
(may take work to crawl)
Stephen F. Austin State University
http://www.sfasu.edu
Doesn't seem centralized...
Sul Ross State University
http://www.sulross.edu
Central repo: http://srinfo.sulross.edu/syllabi/index.php
Tarleton State University
http://www.tarleton.edu
Central repo: http://catalog.tarleton.edu/syllabus/
Texas A&M International
http://www.tamiu.edu
Central repo: http://info.tamiu.edu/courseslist.aspx
Texas A&M
http://www.tamu.edu
http://www.tamug.edu
http://www.tamus.edu
http://www.tarleton.edu/CENTRALTEXAS/
http://www.tamuc.edu
http://www.tamucc.edu
http://www.tamuk.edu
http://www.tamuk.edu/sanantonio/
http://www.tamut.edu
http://www.tsu.edu
http://www.txstate.edu
http://www.tsus.edu
http://www.ttu.edu
http://www.texastech.edu
http://www.twu.edu
http://www.uta.edu
http://www.utexas.edu
http://www.utb.edu
http://www.utdallas.edu
http://www.utep.edu
http://www.utsa.edu
http://www.uttyler.edu
http://www.utpb.edu
http://www.utsystem.edu
http://www.panam.edu
http://www.uh.edu
http://www.uhsa.uh.edu
http://www.uhcl.edu
http://www.uhd.edu
http://www.uhv.edu
http://www.unt.edu
http://www.unt.edu/unt-dallas/
http://www.lawschool.untsystem.edu/
http://www.wtamu.edu
http://www.texaseducationinfo.org/
I was reading over the documentation for storing entire files in MongoDB and it suggests that GridFS be used for files above 16MB but that for files below 16MB, you use BinData.
Do we expect to have files above 16MB? It seems like most syllabi are going to be well below this threshold. As such, GridFS may be using a machine gun to kill a fly.
What are the downsides of using GridFS for files below 16MB? It sounds as though the files get broken up into separate pieces and therefore take longer to load. There may be other drawbacks as well.
Thoughts?
99% of the files will likely be in pdf, html, of doc/docx, but there may be other file formats...
Jonah I'll explain in person
Heyman Center
Harvard (Amy Brand)
Follow up with Matthew Hart and COI @denten
should yield 500k / month
django api should talk to mongo--mongo api is not robust enough
Write up mini-paper comparing performance of various text-extractors on a document with available plaintext (possibly a particular edition of the bible).
who has dev access?
who has keys?
Amazon Linux is CentOS based. Moving to Debian based wile it is still easy. Most of local expertise here is Debian.
We would like to demo full text search from the web side. Could we take relatively clean subset of 100k or so documents and get that to work?
Note: This is one of several issues related to basic information retrieval from the syllabi. We are assuming in all cases that the extraction is from a .txt document.
Task: Given a syllabus in .txt format, identify and extract the books, articles, or other citations that are mentioned.
Example: Given the following :
"Required Texts
Goethe, Elective Affinities, trans. Constantine (Oxford)
Stendhal, The Red and the Black, trans. Gard (Penguin)
Dostoievsky, Crime and Punishment, trans. Pevear and Volkhonsky (Vintage)
Flaubert, Sentimental Education, trans. Baldick (Penguin)
Tolstoy, Anna Karenina, trans. Pevear and Volkhonsky (Penguin)
Mann, Buddenbrooks, trans. Woods (Vintage)"
Extract the following:
citation1 = {
"authorFirstName": None,
"authorLastName": "Goethe",
"type": "Book",
"title": "Elective Affinities",
"publisher": "Oxford",
"translator": "Constatine"
"publicationDate": None
}
etc.
UNC End of Semester Goals? TNS?
Teams submit goals lists
Note: This is one of several issues related to basic information retrieval from the syllabi. We are assuming in all cases that the extraction is from a .txt document.
Task: Given a syllabus in .txt format, identify and extract the name of the college or university where the course is being taught.
There are a few pre-existing python packages for this...
Crawl goes into hopper first. Ingest script to hash, test, and write to main archive.
Can someone make a scraper for #syllabus on Twitter? Lots of links to syllabi PDFs.
We are backing up the mounted volume daily. We should make sure MongoDB is being backed up as well.
Folks, who knows what this is? https://github.com/gerbal/OSP
Do we need to move any of it into our GitHub?
Note: This is one of several issues related to basic information retrieval from the syllabi. We are assuming in all cases that the extraction is from a .txt document.
Task: Given a syllabus in .txt format, identify and extract the title of the course.
People:
Brewster Kahle / Board
Craig Calhoun
Dave Weinberger / board
Gary Hall (Coventry), via TB
@matthewwillse has mentioned some friends at CUNY
I'll distribute the list on the Yale ISP list
Ask Biella Coleman to retweet?
Ask Mako Hill to retweet?
Ask Chris Csikszentmihalyi to retweet?
Ask Dan Cohen to retweet?
Anyone at NYU?
Anyone at Fordham?
Anyone at Mozilla?
Suggestions from Josh Greenberg?
Legal-oriented white paper
GitHub Things:
Toolkit of OSP - most useful Python resources
Some Sakai Thing on GitHub
research: HEOA (HEAA?) and risk management + compliance
Website wishlist gDoc URL
Interfaces:
We need a few 'interfaces' like an individual submission process. These are tricky, because our goal isn't to get single syllabi, it's to build positive relationships with people who are down with the OSP.
Add site-monitoring service
test 1 2 3
screen by file size and extension
Right volumes have to be mounted by hand each time after server restart. Need to automate.
Work with Kyle form CiteSeer to crawl and ingest potential syllabi.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.