Git Product home page Git Product logo

lambretta's Introduction

lambretta

Source code and labeled dataset for Lambretta

#Steps of Running the code and generating intermediate files.

  • Run candidate_query_generator.py (Running claimextractor.py would be needed if we started from raw tweets directly, without the claims extracted). But that's been already done, and training_claims.csv is a result of that. This should output candidate_queries.txt and dict_claim_query.json.
  • Run fetch_results.py . Make sure to update the awk_source_path (path of the datasource) so that the awk_query function works properly. In case you are using ElasticSearch or other interfaces, you can update the function accordingly. Additionally, also specify awk_output_export_path which will be written as an intermediate output file and needed on further step.
  • Run generate_semantic_features.py. This needs file written in awk_output_export_path in fetch_results.py as input. The output of running this will be results_scoring.json.
  • Run export_all_features.py. This needs the file results_scoring.json from previous step. This will write export_ltr_train.txt and export_ltr_test.txt , the files that shall go into the Learning To Rank Java programs.

Dataset

The file lambretta_dataset.json contains the dictionary of claims and list of tweets discussing the claim, alongside their moderation status (0 or 1).

lambretta's People

Contributors

codepujan avatar

Watchers

 avatar Gianluca Stringhini avatar Jay Patel avatar emidec avatar  avatar

lambretta's Issues

How to generate candidates_test.txt

in fetch_results_py line 20 candidates_test.txt file is being opened how do I generate the file before it is opened

xx=open("candidates_test.txt")
search=[]
for x in xx:
    x=x.rstrip()
    search.append(x)

fetch_results.py -> cc, results, querystring is undefined

On line 43 cc is undefined, and what does it stand for?

    for item in splitter:
        addedawk+='/'+item+'/'
        if cc<len(splitter):#for the last item we don't need && 
            addedawk+=' && '
    addedawk+="' "+awk_source_path+" > cmdtmp";
    cmd=awkroot+addedawk
    print("Running for query :  ",querystring," command ",cmd)

On line 76 and 80 results and querystring is undefined

        result=awk_query(query)
        with open(awk_output_export_path,"a+") as of:
            json.dump({"keyword":query,"data":results},of)
            of.write("\n")
    except Exception as err:
        ff=open("awkerrors_test.txt","a+")
        ff.write("quer"+querystring+" Error : "+str(err))

Reproduction:

On line 20 replace (Because of #4 )

xx=open("candidates_test.txt")

with

xx=open("candidate_queries.txt")

Run

python3 ./fetch_results.py

Python gives this error:

image

Python is not reporting results because it has not been encountered in the execution path.

 json.dump({"keyword":query,"data":results},of)

Order of the files to be executed

What is the order of the files to be executed inorder to get the output

The many files in the repo are:

  1. README.md
  2. claimextractor.py
  3. candidate_query_generator.py
  4. fetch_results.py
  5. training_claims.csv
  6. export_all_features_ltr.py
  7. generate_semantic_features.py

export_all_features_ltr.py -> 172 jsx is undefined

jsx is undefined and what does it identify?

'''
Create TF-IDF vectorizer 
'''
#First, generate the document by accumulating all the claims 
docs=[]
for item in jsx:
    tr4w = TextRank4Keyword()
    claim=item["claim"]
    docs.append(jsx["claim"])
cv=CountVectorizer()
word_count_vector=cv.fit_transform(docs)
tfidf_transformer=TfidfTransformer(smooth_idf=True,use_idf=True)
tfidf_transformer.fit(word_count_vector)
def sort_coo(coo_matrix):
    tuples = zip(coo_matrix.col, coo_matrix.data)
    return sorted(tuples, key=lambda x: (x[1], x[0]), reverse=True)
feature_names=cv.get_feature_names()

Reproduction:
Run

python3 ./export_all_features_ltr.py 

Python will report
image

Generated awk_output_test.json is not valid JSON

The generated output is of type

File: awk_output_test.json

{"keyword": "value", "data": null}
{"keyword": "value", "data": null}
{"keyword": "value", "data": null}
{"keyword": "value", "data": null}

Which is not valid JSON

[DOUBDT] awk_output_test.json data is null when top 50 rows of the dataset is used

In order to save time during data processing, I decided to use a partial dataset, specifically the top 50 rows. However, I encountered an unexpected issue where the data field generated in awk_output_test.json is null .

Is it an expected behaviour?

This is the code i used to generate the partial dataset

import pandas as pd

df = pd.read_csv('data_source/data_source.csv', nrows=50)

df.to_csv('data_source/MOD_data_source.csv', index=False)

File: awk_output_test.json

{"keyword": "pennsylvania democrat", "data": null}
{"keyword": "democrat pre-canvass", "data": null}
{"keyword": "pre-canvass vote", "data": null}
{"keyword": "vote liberal", "data": null}
{"keyword": "liberal areas", "data": null}
{"keyword": "areas let", "data": null}
.
.
.

All data fields are null and the file has 7087 lines

Undefined variable `total` in ` generate_semantic_features.py`

The variable total is undefined

        # Creating spanning subset
        left_join = sorted_data[0:int(0.2*len(sorted_data))]
        right_join = sorted_data[-int(0.2*len(sorted_data)):]
        mid = int(len(sorted_data)*0.5)
       # total ๐Ÿ‘‡ ๐Ÿ‘‡๐Ÿ‘‡
        mid_join = sorted_data[mid-(int(total*0.1)):mid+(int(total*0.1))]
        # Sliced data is spanning subset discussed in the paper
        sliced_data = left_join+mid+right_join
        texts = []

image

candidate_query_generator.py -> raw_cleaned_tweet is undefined

In candidate_query_generator.py on line 39-40

 raw_cleaned_claim.append(item.strip())
 cleaned_tweet=' '.join(raw_cleaned_tweet).strip()

raw_cleaned_tweet is undefined

Perhaps raw_cleaned_claim is the recommended or intended variable to use instead

Replacing raw_cleaned_tweet with raw_cleaned_claim fixes it (Correct me if I am wrong)

Reproduction:

Run

python3 ./candidate_query_generator.py

Python reports
image

In generate_semantic_features.py, if data is null it throws an exception

for x in xx:
    try:
        x = x.rstrip()
        jsx = json.loads(x)
        query = jsx["keyword"]
        query_split = query.split(" ")
        query = ' '.join(query_split)
        print("Working on ... ", query)
        data = jsx["data"]
        if len(data) == 0: # <- error on this line

Please note that the line number printed in the traceback output differs from the actual line number because I imported traceback

Error

object of type 'NoneType' has no len()
Working on ...  change address flags fulton
An exception occurred on line number:
Traceback (most recent call last):
  File "generate_semantic_features.py", line 88, in <module>
    if len(data) == 0:
TypeError: object of type 'NoneType' has no len()

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.