Git Product home page Git Product logo

debate-cards's Introduction

Debate Cards

System to search and download debate evidence - "cards" - that are parsed from word documents. Live at debate.cards

Project Info

This project is currently being rewritten as part of an ongoing refactor and migration to TypeScript. See the development here.

The version of this project that is currently deployed at debate.cards is tagged as v1.0.0

Description

Word documents are converted to html using pandoc, the resulting markup is then parsed so that cards can be split up based on heading level.

Parsed data is stored in a mongo database which is indexed by Apache Solr.

Cards can then be converted back into Word documents via pandoc.

Notes

This is project is only the backend for debate.cards. The frontend wesbite can be found here.

Requirements

Installing

Once external dependencies are installed, the project can be installed using your favorite pack manager

npm install 

or

yarn install

Deployment

Note: This isn't a full guide on how to deploy the project, just the gist of it. I might expand this section in the future. Feel free to open an issue in the mean time for questions.

A solr core first needs to be created, the relevent config files are located in the solr-config directory. The scraping code is desgined to save data to the mongo database - in order to keep data synchronized between solr and mongo, use mongo-connector

mongo-connector -m localhost:27017 -t http://localhost:8983/solr/debatecards -d solr_doc_manager&

Pandoc needs to be installed as a system dependency.

Place .env file in project root. Sample .env file at .env.sample.

The application should be ready to run at this point.

Populate the database through the REST API (POST /file), which will automatically add documents the parse queue.

Orginal copies of all parsed documents are stored in the directory set in the enviroment.

Each worker in the /worker directory is desgined to run as it's own process, which mean that can each be loaded sepeartelt if need. Might be useful if you want to run the application across multiple machines, or if you just want to disable certain features.

A proccess manager like pm2 is recommended for production

Public API

API documentation coming soon!

debate-cards's People

Contributors

arvind-balaji avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

debate-cards's Issues

Search on website not working

FYI: Not sure this belongs here, but I have tried on several machines, but SEARCH on website does not return any results.

Questions about a potential integration

Hello, the work that has been done on this project and the DebateSum paper are really amazing, and we would like to build on it over at Arguflow!

Integration with Arguflow Vault

In a lot more detail, I explain what Vault is and our intentions for it in this blog post. But a TLDR is below.

Arguflow vault is a website that allows users to create, query, and rate embeddings that consist of a link and arbitrary text content.

Some of the key features are

  • semantic search
  • duplication prevention
  • ability to upvote/downvote
  • ability to view users' and cards' cumulative ratings

Integration Plan

We would like to create embeddings for the data that you currently have and add it to Vault's db such that it can be searched and rated. As a part of that, we will have to work out a plan for appropriate accreditation of the included evidence.

If it's not possible to get the data from the API directly, we would then like to use the scraper you have built to pull it out of the open-evidence project briefs.

Questions

  • How can we get access to the data? Currently the API seems to be down as far as we could tell from looking at the network requests made on debate.cards.
  • Does the data include the source URL's for the cards? Looking at the prisma schema, it didn't seem like it

Contact info

I should respond to any replies here on Github fairly quickly. Additionally, I would love to meet to talk about things more generally and have a cal.com meeting link here .

Unified module config system

Move configurable properties for modules into a config file.
For example, it would be nice to specify a white list of sets for the wiki module. That config could go here.
Could be used to enable/disable modules, but might be better suited as env var?

Search on website broken

FYI: Not sure this belongs here, but I have tried on several machines, but SEARCH on website does not return any results.

Containerize application

Write Dockerfile for application code.

Create docker-compose for app, Redis, Postgres, and indexer stack.

Experiment with running modules as standalone docker containers.

Search Feature broken on website

new issue? been using the cite for a while and havent come across this. check it out and lmk if its a client side issue, tho i doubt cuz i tried on 2 different laptops and network providers...

Indexer setup

Need to decide between Elastic Search and Solr. Planning to test both.

Running from a flash?

I'm relatively green with programming, but if I downloaded a bunch of speech docs to a flash, as well as all of the programs you have listed as required onto a flash/the computer I'd be using this on, would I be able to have it parse cards and search them from those docs?

Licensing?

Just wondering what license debate-cards falls under.
Thanks.

Updated source code?

So I'm trying to self host this on my machine, and I've noticed some issues that appear to be caused because the code I have is somehow different from what you're running on your production servers.
Some example differences:
api.debate.cards returns the card text in the field fullCard, whereas mine returns it in the field card
My instance clears the list of searched cards upon viewing a card (happens on both my and your API backend)
In general I have had to make slight changes to your code to get things to work (like changing the code that generates the solr query)

Would it be possible for you to update the code that exists (in this repo and the cardDB repo), as well as any config files that might have changed?

How to use your tool to download *all* the cards?

First of all, thank you so much for making this project.

I've previously experimented with trying to create sequence - sequence models which would automatically underline policy debate evidence. My issue is that I always had trouble parsing each card and getting it into a format outside of .docx.

It looks like your project could seriously help me in this regards. I'm looking to download a curated list of as many debate cards as I can, and specifically in a format that makes it easy for me to use python to programmatically figure out which part of the card is underlined and which isn't. This was tough when I tried to use pythons tools for reading docx files.

Let me know if you have any advice for me.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.