mumin-dataset / mumin-build Goto Github PK

Seamlessly build the MuMiN dataset.

License: MIT License

Python 97.99% Makefile 2.01%

mumin misinformation dataset graph deep-graph-library pytorch-geometric

mumin-build's Issues

Parallel Jobs and checkpoints for dataset compile process

Hi,
first of all let me say I do like your work very much. The MuMin project is very interesting and well organized. And I have a vague idea about how much effort it might be required to build it.

I started used your scripts to download and compile the dataset and, as you mentioned, setting up all the flags slow down the overall process. Maybe a bunch of parallel processes would help to speed up and Python libraries as multiprocessing and joblib make that not to complicated.
But, the most urgent change to your code I would like to suggest is some kind of intermediate checkpoint mechanism during the compiling phase. I experienced a technical issue and lost many days of tweet rehydrating. This last feature would be very appreciate.

Thanks again for your work and research.

The `ids` query parameter value [-xxxx] is not valid

Thank you for your wonderful work!
It worked well on Colab but when I want to run the same code on my laptop, it keep showing 400 "The ids query parameter value [-xxxx] is not valid".

I guess that the id shouldn't be negative, but why many ids is negative?

the mumin-large dataset
win10
python3.8.10

The dataset seems not to be at the address now.

Hi, I tried to compile the Dataset according to the start-up guidelines.
But It seems the step of download could not work. I got 404 error when download the dataset.

dataset = MuminDataset(twitter_bearer_token=twitter_bearer_token, size='small')
dataset
MuminDataset(size=small, rehydrated=False, compiled=False, bearer_token_available=True)
dataset.compile()
2024-02-23 08:56:22,985 [INFO] Downloading dataset
Traceback (most recent call last):
File "", line 1, in
File "/home/danni/anaconda3/envs/mumin/lib/python3.8/site-packages/mumin/dataset.py", line 259, in compile
self._download(overwrite=overwrite)
File "/home/danni/anaconda3/envs/mumin/lib/python3.8/site-packages/mumin/dataset.py", line 346, in _download
raise RuntimeError(f"[{response.status_code}] {response.content!r}")
RuntimeError: [404] b'not found'

Then I tried to open the link "https://data.bris.ac.uk/datasets/23yv276we2mll25f" in dataset.py, line 115 download_url: str = ('https://data.bris.ac.uk/datasets/23yv276we2mll25f' 'jakkfim2ml/23yv276we2mll25fjakkfim2ml.zip')). I also got "not found" from the browser page.
Could you please help check whether the link address could work? Or give some advice about what can I do to solve this problem?
Thanks!

Keep showing this error: KeyError: 'tweet_id'.

Hi! Thank you for your wonderful work! I ran the tutorial on Colab but when I want to run the same code on my own laptop (Compile the dataset), it always shows this: "KeyError: 'tweet_id'".

I am using python 3.9 and pycharm as my IDE. Any idea why this is happening? Thanks a lot for your help!

Claim Node

Hi,
thanks for your work. I'm interested on the text of claims, not only embeddings or keywords. Is it possible to obtain them in any way?

Could you consider adding an use example of PyG？

Hi,
thanks for your open source and your work is very interesting and deserves to be explored further.

Could you consider adding an use example of PyG (PyTorch-Geometric) in your code? That would make a lot of sense to me.

BrokenPipeError: [Errno 32] Broken pipe when parsing images

The error occurred when the parsing images was around 60-70 percent.

Error Message

During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.8/multiprocessing/pool.py", line 136, in worker
    put((job, i, (False, wrapped)))
  File "/usr/local/lib/python3.8/multiprocessing/queues.py", line 368, in put
    self._writer.send_bytes(obj)
  File "/usr/local/lib/python3.8/multiprocessing/connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/usr/local/lib/python3.8/multiprocessing/connection.py", line 404, in _send_bytes
    self._send(header)
  File "/usr/local/lib/python3.8/multiprocessing/connection.py", line 368, in _send
    n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe

Ubuntu 18.04
Python 3.8.10
RAM 8GB
4-core CPU

Compile Dataset in Tutorial raises KeyError

Hello - I was running the Google Colab Notebook that maintains the tutorial on how to get started loading and compiling the MuMiN dataset. When running dataset.compile() under "2.3 Compile Dataset", I received a KeyError regarding a 'tweet_id' (below). No modifications were made to the existing code, and I was able to use a Twitter API Bearer Key. The dataset loaded is the small version.

INFO:mumin.dataset:Downloading dataset
Downloading MuMiN: 100%
200M/200M [10:42<00:00, 310kiB/s]
INFO:mumin.dataset:Converting dataset to less compressed format
INFO:mumin.dataset:Loading dataset
INFO:mumin.dataset:Shrinking dataset
INFO:mumin.dataset:Rehydrating tweet nodes
Rehydrating: 0%
0/5261 [00:09<?, ?it/s]
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
[<ipython-input-4-da99dd72c67a>](https://localhost:8080/#) in <module>
----> 1 dataset.compile()

6 frames
[/usr/local/lib/python3.7/dist-packages/pandas/core/generic.py](https://localhost:8080/#) in _get_label_or_level_values(self, key, axis)
   1777             values = self.axes[axis].get_level_values(key)._values
   1778         else:
-> 1779             raise KeyError(key)
   1780 
   1781         # Check for duplicates

KeyError: 'tweet_id'

BrokenPipeError: [Errno 32] Broken pipe

When parsing articles, it keep getting BrokenPipeError: [Errno 32] Broken pipe, and the process was terminated.

Error message

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.8/multiprocessing/pool.py", line 136, in worker
    put((job, i, (False, wrapped)))
  File "/usr/local/lib/python3.8/multiprocessing/queues.py", line 368, in put
    self._writer.send_bytes(obj)
  File "/usr/local/lib/python3.8/multiprocessing/connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/usr/local/lib/python3.8/multiprocessing/connection.py", line 404, in _send_bytes
    self._send(header)
  File "/usr/local/lib/python3.8/multiprocessing/connection.py", line 368, in _send
    n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe

Rehydrating is really time-consuming, above process ran for more than 15 hours. So I really wish adding a checkpoint after rehydrating. Thanks.

OS: Ubuntu 18.04
Python: 3.8.10

SSL Error

Hello,

Thank you for the great effort in this dataset.
I am running the tutorial you provided on Colab and I am trying to compile the dataset but it keeps giving me an SSL error as the following:

SSLError: HTTPSConnectionPool(host='data.bris.ac.uk', port=443): Max retries exceeded with url: /datasets/23yv276we2mll25fjakkfim2ml/23yv276we2mll25fjakkfim2ml.zip (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: certificate has expired (_ssl.c:1091)')))

I believe it's because of the URL of the dataset, how can I fix this error?

ConnectionResetError(104, 'Connection reset by peer') when compiling data set

Hello! Firstly, I want to say thank you for constructing this awesome data set.

I'm trying to follow all steps of your tutorial notebook. When compiling the data set, I always got this error. I guess this may because some replies cannot be pulled to local anymore. Anyway, it doesn't affect the compiling process, the whole dataset can still be compiled.

2022-03-20 06:54:13,247 [INFO] Downloading dataset
Downloading MuMiN: 100% 200M/200M [00:05<00:00, 38.6MiB/s]
2022-03-20 06:54:22,308 [INFO] Converting dataset to less compressed format
2022-03-20 06:54:51,718 [INFO] Loading dataset
2022-03-20 06:55:00,892 [INFO] Shrinking dataset
2022-03-20 06:55:01,616 [INFO] Rehydrating tweet nodes Rehydrating: 100%
5261/5261 [00:39<00:00, 123.16it/s]
2022-03-20 06:55:41,218 [INFO] Rehydrating reply nodes Rehydrating: 80%
156600/196080 [1:19:10<16:21, 40.23it/s]
2022-03-20 07:45:05,274 [ERROR] [('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))] Error in rehydrating tweets.
The parameters used were {'expansions': 'attachments.poll_ids,attachments.media_keys,author_id,entities.mentions.username,geo.place_id,in_reply_to_user_id,referenced_tweets.id,referenced_tweets.id.author_id', 'media.fields': 'duration_ms,height,media_key,preview_image_url,type,url,width,public_metrics', 'place.fields': 'contained_within,country,country_code,full_name,geo,id,name,place_type', 'poll.fields': 'duration_minutes,end_datetime,id,options,voting_status', 'tweet.fields': 'attachments,author_id,conversation_id,created_at,entities,geo,id,in_reply_to_user_id,lang,public_metrics,possibly_sensitive,referenced_tweets,reply_settings,source,text,withheld', 'user.fields': 'created_at,description,entities,id,location,name,pinned_tweet_id,profile_image_url,protected,public_metrics,url,username,verified,withheld', 'ids': '1419766026049306630,1419179535053082626,1418866617430196228,1419818929938214918,1419819108582019073,1419759924943458306,1419751987718115333,1419815044922187779,1419813735712141319,1419746088198852616,1419670675569192969,1419813104846843904,1418874699375910914,1420165357206179841,1420054853905260545,1419750925590319108,1418149067092283397,1418153826977189891,1418062869745356801,1418054670333923328,1418048322242334720,1418735995319619589,1418904955721822218,1418738661638672385,1418738467983527939,1418739162241507331,1418926908247617543,1418736432655568898,1418753185037041667,1418737465351970819,1418751910283202560,1418738900235915268,1418739856612732935,1418739081136250888,1418736592156508165,1418737256362385415,1418737448331448320,1419028600397963276,1418739597765451779,1418739464340377602,1418738654785294337,1418737963131867139,1418736113531957251,1418738840446119947,1418737527637348353,1418737100422320130,1418748815566376963,1418740966186438658,1418738485125685249,1418736817042608133,1418749708831535106,1418738185715208198,1418763262129328137,1418738418041950208,1418739111905701898,1418740809273421826,1418739763541073920,1418737715940577284,1418738129348014080,1418738334495645705,1418740520717852678,1418740314974703622,1418736092099104769,1418735990517256193,1418735969050767361,1418709410172579843,1418705709479436293,1418710499286560768,1418713910493032449,1418699589260038146,1418763906940608518,1418699431847813125,1418701413916491779,1418700067633041409,1418699500890243072,1418868262646001664,1418698912211288064,1418703268939763713,1418699607161327616,1418699736450744327,1418702188793278469,1418761089316184067,1418709105720582148,1418698873292345345,1418730257524346882,1418712177511436290,1418700011337093128,1347720016800731136,1347248562283950081,1347238633628164098,1347382097279918080,1347262227712192514,1347233634449821699,1347235750518132738,1348752879142789120,1347360138785644544,1347267576779460608,1347554116277559298,1347475613146370048,1347231636950310912'}.

However, when following the tutorial and calling dataset.to_dgl() function with the dataset with added embeddings, I got the following error. I guess this error is due to the missing part of the data above, since the number of all claims is 2168 and there may lack all replies of one certain claim. I'm not sure about this.

DGLError                                  Traceback (most recent call last)
Input In [25], in <module>
      1 if 'dgl_graph' not in globals():
----> 2     dgl_graph = dataset.to_dgl()
      3 dgl_graph

File d:\develop\environment\python\python3-8-10\lib\site-packages\mumin\dataset.py:943, in MuminDataset.to_dgl(self)
    936 '''Convert the dataset to a DGL dataset.
    937 
    938 Returns:
    939     DGLHeteroGraph:
    940         The graph in DGL format.
    941 '''
    942 logger.info('Outputting to DGL')
--> 943 return build_dgl_dataset(nodes=self.nodes, relations=self.rels)

File d:\develop\environment\python\python3-8-10\lib\site-packages\mumin\dgl.py:193, in build_dgl_dataset(nodes, relations)
    191     rev_embs = emb_to_tensor(nodes['claim'], 'reviewer_emb')
    192     tensors = (claim_embs, rev_embs)
--> 193     dgl_graph.nodes['claim'].data['feat'] = torch.cat(tensors, dim=1)
    194 else:
    195     dgl_graph.nodes['claim'].data['feat'] = claim_embs

File d:\develop\environment\python\python3-8-10\lib\site-packages\dgl\view.py:84, in HeteroNodeDataView.__setitem__(self, key, val)
     80 else:
     81     assert isinstance(val, dict) is False, \
     82         'The HeteroNodeDataView has only one node type. ' \
     83         'please pass a tensor directly'
---> 84     self._graph._set_n_repr(self._ntid, self._nodes, {key : val})

File d:\develop\environment\python\python3-8-10\lib\site-packages\dgl\heterograph.py:4118, in DGLHeteroGraph._set_n_repr(self, ntid, u, data)
   4116 nfeats = F.shape(val)[0]
   4117 if nfeats != num_nodes:
-> 4118     raise DGLError('Expect number of features to match number of nodes (len(u)).'
   4119                    ' Got %d and %d instead.' % (nfeats, num_nodes))
   4120 if F.context(val) != self.device:
   4121     raise DGLError('Cannot assign node feature "{}" on device {} to a graph on'
   4122                    ' device {}. Call DGLGraph.to() to copy the graph to the'
   4123                    ' same device.'.format(key, F.context(val), self.device))

DGLError: Expect number of features to match number of nodes (len(u)). Got 2168 and 2167 instead.

Is there anything I can do to solve the second error? If the second error is really caused by this reason, is there a proper way to remove this certain claim and all data related to it in the compiled dataset?

User Posted Reply rows not found

I have been working with a downloaded version of the Mumin Medium dataset to try and retrieve user information based on replies they have made.
I keep trying to retrieve a User's Row ID via the 'user_reply_to_tweet' relationship dataset and there is no corresponding index for some users.

For example:

In the (reply, reply_to, tweet) relationship dataset the fifth src value is 406564.
I then try to retrieve the User's Row ID via the (user, posted, reply) relationship dataset using 406564 as the tgt value.
The row cannot be found, upon closer inspection the dataset ends with the highest row value being 406562.

I then looked into this further and have found 7201 rows failing to connect a reply to a user.
Any help you can give in resolving this would be great. Is this a result of users removing their accounts?

I have attached the code I have used below:

bearer_token = ""
dataset = MuminDataset(
    twitter_bearer_token="",
    size="medium",
    include_replies=True,
    include_articles=False,
    include_hashtags=False,
    dataset_path="mumin-medium.zip",
    include_tweet_images=False,
    include_extra_images=False
)

# Compile
dataset.compile()

# Get all the users
users_list = dataset.nodes["user"]

# Get the reply and tweet rel connection values
reply_reply_to_tweet = dataset.rels[('reply', 'reply_to', 'tweet')]
reply_reply_to_tweet.columns = ["rrt_" + c for c in list(reply_reply_to_tweet.columns)]

# Get the user posted reply rel connection values
user_posted_reply = dataset.rels[('user', 'posted', 'reply')]
user_posted_reply.columns = ["upr_" + c for c in list(user_posted_reply.columns)]

# Join both relational datasets to retrieve the user rows for each reply
merged_dataset = pd.merge(reply_reply_to_tweet, user_posted_reply, how="outer", left_on=["rrt_src"], right_on=["upr_tgt"])

# Locate rows with no connection
merged_dataset_only_NaN = merged_dataset.loc[merged_dataset["upr_tgt"].isnull()]

print(merged_dataset)
print(merged_dataset_only_NaN)

compile method returning 403 and KeyError: 'tweet_id'

Hi everyone!

I can't compile mumin dataset. I had installed the lib, put my twitter bearer token to create the dataset object but when I try to compile I receive two error messages. The first one says something related to the twitter token, but I have already tested the same token in other situations and it works. The second message says that couldn't find tweet_id. Can you help me?

FIRST ERROR MESSAGE (I hide my client_id using 'xxxxx'):

/2023-06-29 20:36:38,903 [INFO] Loading dataset 2023-06-29 20:36:45,029 [INFO] Shrinking dataset 2023-06-29 20:36:46,178 [INFO] Rehydrating tweet nodes Rehydrating: 0%| | 0/5261 [00:00<?, ?it/s]2023-06-29 20:36:46,475 [ERROR] [403] {"client_id":"xxxxx","detail":"When authenticating requests to the Twitter API v2 endpoints, you must use keys and tokens from a Twitter developer App that is attached to a Project. You can create a project via the developer portal.","registration_url":"https://developer.twitter.com/en/docs/projects/overview","title":"Client Forbidden","required_enrollment":"Appropriate Level of API Access","reason":"client-not-enrolled","type":"https://api.twitter.com/2/problems/client-forbidden"}

SECOND ERROR MESSAGE

`
KeyError Traceback (most recent call last)
Cell In[25], line 1
----> 1 mumin_small.compile()

File ~\AppData\Roaming\Python\Python311\site-packages\mumin\dataset.py:251, in MuminDataset.compile(self, overwrite)
248 self._shrink_dataset()
250 # Rehydrate the tweets
--> 251 self._rehydrate(node_type='tweet')
252 self._rehydrate(node_type='reply')
254 # Update the IDs of the data that was there pre-hydration

File ~\AppData\Roaming\Python\Python311\site-packages\mumin\dataset.py:553, in MuminDataset._rehydrate(self, node_type)
549 self.nodes['user'] = user_df
551 # Add prehydration tweet features back to the tweets
552 self.nodes[node_type] = (self.nodes[node_type]
--> 553 .merge(prehydration_df,
554 on='tweet_id',
555 how='outer')
556 .reset_index(drop=True))
558 # Extract and store images
559 # Note: This will store self.nodes['image'], but this is only
560 # to enable extraction of URLs later on. The
561 # self.nodes['image'] will be overwritten later on.
562 if (node_type == 'tweet' and self.include_tweet_images and
563 len(source_tweet_dfs['media'])):

File ~\AppData\Roaming\Python\Python311\site-packages\pandas\core\frame.py:10090, in DataFrame.merge(self, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator, validate)
10071 @substitution("")
10072 @appender(_merge_doc, indents=2)
10073 def merge(
(...)
10086 validate: str | None = None,
10087 ) -> DataFrame:
10088 from pandas.core.reshape.merge import merge

10090 return merge(
10091 self,
10092 right,
10093 how=how,
10094 on=on,
10095 left_on=left_on,
10096 right_on=right_on,
10097 left_index=left_index,
10098 right_index=right_index,
10099 sort=sort,
10100 suffixes=suffixes,
10101 copy=copy,
10102 indicator=indicator,
10103 validate=validate,
10104 )

File ~\AppData\Roaming\Python\Python311\site-packages\pandas\core\reshape\merge.py:110, in merge(left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator, validate)
93 @substitution("\nleft : DataFrame or named Series")
94 @appender(_merge_doc, indents=0)
95 def merge(
(...)
108 validate: str | None = None,
109 ) -> DataFrame:
--> 110 op = _MergeOperation(
111 left,
112 right,
113 how=how,
114 on=on,
115 left_on=left_on,
116 right_on=right_on,
117 left_index=left_index,
118 right_index=right_index,
119 sort=sort,
120 suffixes=suffixes,
121 indicator=indicator,
122 validate=validate,
123 )
124 return op.get_result(copy=copy)

File ~\AppData\Roaming\Python\Python311\site-packages\pandas\core\reshape\merge.py:703, in _MergeOperation.init(self, left, right, how, on, left_on, right_on, axis, left_index, right_index, sort, suffixes, indicator, validate)
696 self._cross = cross_col
698 # note this function has side effects
699 (
700 self.left_join_keys,
701 self.right_join_keys,
702 self.join_names,
--> 703 ) = self._get_merge_keys()
705 # validate the merge keys dtypes. We may need to coerce
706 # to avoid incompatible dtypes
707 self._maybe_coerce_merge_keys()

File ~\AppData\Roaming\Python\Python311\site-packages\pandas\core\reshape\merge.py:1179, in _MergeOperation._get_merge_keys(self)
1175 if lk is not None:
1176 # Then we're either Hashable or a wrong-length arraylike,
1177 # the latter of which will raise
1178 lk = cast(Hashable, lk)
-> 1179 left_keys.append(left._get_label_or_level_values(lk))
1180 join_names.append(lk)
1181 else:
1182 # work-around for merge_asof(left_index=True)

File ~\AppData\Roaming\Python\Python311\site-packages\pandas\core\generic.py:1850, in NDFrame._get_label_or_level_values(self, key, axis)
1844 values = (
1845 self.axes[axis]
1846 .get_level_values(key) # type: ignore[assignment]
1847 ._values
1848 )
1849 else:
-> 1850 raise KeyError(key)
1852 # Check for duplicates
1853 if values.ndim > 1:

KeyError: 'tweet_id'
`

mumin-dataset / mumin-build Goto Github PK

mumin-build's Issues

Parallel Jobs and checkpoints for dataset compile process

The `ids` query parameter value [-xxxx] is not valid

The dataset seems not to be at the address now.

Keep showing this error: KeyError: 'tweet_id'.

Claim Node

Could you consider adding an use example of PyG？

BrokenPipeError: [Errno 32] Broken pipe when parsing images

Compile Dataset in Tutorial raises KeyError

BrokenPipeError: [Errno 32] Broken pipe

SSL Error

ConnectionResetError(104, 'Connection reset by peer') when compiling data set

User Posted Reply rows not found

compile method returning 403 and KeyError: 'tweet_id'

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent