idramalab / lambretta Goto Github PK

View Code? Open in Web Editor NEW

0.0 5.0 0.0 1.28 MB

Source code and labeled dataset for Lambretta

Python 100.00%

lambretta's Introduction

lambretta

Source code and labeled dataset for Lambretta

#Steps of Running the code and generating intermediate files.

Run candidate_query_generator.py (Running claimextractor.py would be needed if we started from raw tweets directly, without the claims extracted). But that's been already done, and training_claims.csv is a result of that. This should output candidate_queries.txt and dict_claim_query.json.
Run fetch_results.py . Make sure to update the awk_source_path (path of the datasource) so that the awk_query function works properly. In case you are using ElasticSearch or other interfaces, you can update the function accordingly. Additionally, also specify awk_output_export_path which will be written as an intermediate output file and needed on further step.
Run generate_semantic_features.py. This needs file written in awk_output_export_path in fetch_results.py as input. The output of running this will be results_scoring.json.
Run export_all_features.py. This needs the file results_scoring.json from previous step. This will write export_ltr_train.txt and export_ltr_test.txt , the files that shall go into the Learning To Rank Java programs.

Dataset

The file lambretta_dataset.json contains the dictionary of claims and list of tweets discussing the claim, alongside their moderation status (0 or 1).

lambretta's People

Contributors

Watchers

lambretta's Issues

fetch_results.py awk_output_export_path is hardcoded

If the path you have specified does not exist by default, attempting to access it using Python will result in an error.

awk_output_export_path="/data/ppaudeldata/VoterFraud/awk_output_test.json"

How to generate candidates_test.txt

in fetch_results_py line 20 candidates_test.txt file is being opened how do I generate the file before it is opened

xx=open("candidates_test.txt")
search=[]
for x in xx:
    x=x.rstrip()
    search.append(x)

fetch_results.py -> cc, results, querystring is undefined

On line 43 cc is undefined, and what does it stand for?

    for item in splitter:
        addedawk+='/'+item+'/'
        if cc<len(splitter):#for the last item we don't need && 
            addedawk+=' && '
    addedawk+="' "+awk_source_path+" > cmdtmp";
    cmd=awkroot+addedawk
    print("Running for query :  ",querystring," command ",cmd)

On line 76 and 80 results and querystring is undefined

        result=awk_query(query)
        with open(awk_output_export_path,"a+") as of:
            json.dump({"keyword":query,"data":results},of)
            of.write("\n")
    except Exception as err:
        ff=open("awkerrors_test.txt","a+")
        ff.write("quer"+querystring+" Error : "+str(err))

Reproduction:

On line 20 replace (Because of #4 )

xx=open("candidates_test.txt")

with

xx=open("candidate_queries.txt")

Run

python3 ./fetch_results.py

Python gives this error:

Python is not reporting results because it has not been encountered in the execution path.

 json.dump({"keyword":query,"data":results},of)

Order of the files to be executed

What is the order of the files to be executed inorder to get the output

The many files in the repo are:

README.md
claimextractor.py
candidate_query_generator.py
fetch_results.py
training_claims.csv
export_all_features_ltr.py
generate_semantic_features.py

export_all_features_ltr.py -> 172 jsx is undefined

jsx is undefined and what does it identify?

'''
Create TF-IDF vectorizer 
'''
#First, generate the document by accumulating all the claims 
docs=[]
for item in jsx:
    tr4w = TextRank4Keyword()
    claim=item["claim"]
    docs.append(jsx["claim"])
cv=CountVectorizer()
word_count_vector=cv.fit_transform(docs)
tfidf_transformer=TfidfTransformer(smooth_idf=True,use_idf=True)
tfidf_transformer.fit(word_count_vector)
def sort_coo(coo_matrix):
    tuples = zip(coo_matrix.col, coo_matrix.data)
    return sorted(tuples, key=lambda x: (x[1], x[0]), reverse=True)
feature_names=cv.get_feature_names()

Reproduction:
Run

python3 ./export_all_features_ltr.py

Python will report

Generated awk_output_test.json is not valid JSON

The generated output is of type

File: awk_output_test.json

{"keyword": "value", "data": null}
{"keyword": "value", "data": null}
{"keyword": "value", "data": null}
{"keyword": "value", "data": null}

Which is not valid JSON

[DOUBDT] awk_output_test.json data is null when top 50 rows of the dataset is used

In order to save time during data processing, I decided to use a partial dataset, specifically the top 50 rows. However, I encountered an unexpected issue where the data field generated in awk_output_test.json is null .

Is it an expected behaviour?

This is the code i used to generate the partial dataset

import pandas as pd

df = pd.read_csv('data_source/data_source.csv', nrows=50)

df.to_csv('data_source/MOD_data_source.csv', index=False)

File: awk_output_test.json

{"keyword": "pennsylvania democrat", "data": null}
{"keyword": "democrat pre-canvass", "data": null}
{"keyword": "pre-canvass vote", "data": null}
{"keyword": "vote liberal", "data": null}
{"keyword": "liberal areas", "data": null}
{"keyword": "areas let", "data": null}
.
.
.

All data fields are null and the file has 7087 lines

Undefined variable `total` in ` generate_semantic_features.py`

The variable total is undefined

        # Creating spanning subset
        left_join = sorted_data[0:int(0.2*len(sorted_data))]
        right_join = sorted_data[-int(0.2*len(sorted_data)):]
        mid = int(len(sorted_data)*0.5)
       # total 👇 👇👇
        mid_join = sorted_data[mid-(int(total*0.1)):mid+(int(total*0.1))]
        # Sliced data is spanning subset discussed in the paper
        sliced_data = left_join+mid+right_join
        texts = []

candidate_query_generator.py -> raw_cleaned_tweet is undefined

In candidate_query_generator.py on line 39-40

 raw_cleaned_claim.append(item.strip())
 cleaned_tweet=' '.join(raw_cleaned_tweet).strip()

raw_cleaned_tweet is undefined

Perhaps raw_cleaned_claim is the recommended or intended variable to use instead

Replacing raw_cleaned_tweet with raw_cleaned_claim fixes it (Correct me if I am wrong)

Reproduction:

Run

python3 ./candidate_query_generator.py

Python reports

In generate_semantic_features.py, if data is null it throws an exception

for x in xx:
    try:
        x = x.rstrip()
        jsx = json.loads(x)
        query = jsx["keyword"]
        query_split = query.split(" ")
        query = ' '.join(query_split)
        print("Working on ... ", query)
        data = jsx["data"]
        if len(data) == 0: # <- error on this line

Please note that the line number printed in the traceback output differs from the actual line number because I imported traceback

Error

object of type 'NoneType' has no len()
Working on ...  change address flags fulton
An exception occurred on line number:
Traceback (most recent call last):
  File "generate_semantic_features.py", line 88, in <module>
    if len(data) == 0:
TypeError: object of type 'NoneType' has no len()

idramalab / lambretta Goto Github PK

lambretta's Introduction

lambretta

Dataset

lambretta's People

Contributors

Watchers

lambretta's Issues

fetch_results.py awk_output_export_path is hardcoded

How to generate candidates_test.txt

fetch_results.py -> cc, results, querystring is undefined

Order of the files to be executed

export_all_features_ltr.py -> 172 jsx is undefined

Generated awk_output_test.json is not valid JSON

[DOUBDT] awk_output_test.json data is null when top 50 rows of the dataset is used

Undefined variable `total` in ` generate_semantic_features.py`

candidate_query_generator.py -> raw_cleaned_tweet is undefined

In generate_semantic_features.py, if data is null it throws an exception

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent