mumin-dataset / mumin-build Goto Github PK
View Code? Open in Web Editor NEWSeamlessly build the MuMiN dataset.
License: MIT License
Seamlessly build the MuMiN dataset.
License: MIT License
Hi,
first of all let me say I do like your work very much. The MuMin project is very interesting and well organized. And I have a vague idea about how much effort it might be required to build it.
I started used your scripts to download and compile the dataset and, as you mentioned, setting up all the flags slow down the overall process. Maybe a bunch of parallel processes would help to speed up and Python libraries as multiprocessing and joblib make that not to complicated.
But, the most urgent change to your code I would like to suggest is some kind of intermediate checkpoint mechanism during the compiling phase. I experienced a technical issue and lost many days of tweet rehydrating. This last feature would be very appreciate.
Thanks again for your work and research.
F
Hi, I tried to compile the Dataset according to the start-up guidelines.
But It seems the step of download could not work. I got 404 error when download the dataset.
dataset = MuminDataset(twitter_bearer_token=twitter_bearer_token, size='small')
dataset
MuminDataset(size=small, rehydrated=False, compiled=False, bearer_token_available=True)
dataset.compile()
2024-02-23 08:56:22,985 [INFO] Downloading dataset
Traceback (most recent call last):
File "", line 1, in
File "/home/danni/anaconda3/envs/mumin/lib/python3.8/site-packages/mumin/dataset.py", line 259, in compile
self._download(overwrite=overwrite)
File "/home/danni/anaconda3/envs/mumin/lib/python3.8/site-packages/mumin/dataset.py", line 346, in _download
raise RuntimeError(f"[{response.status_code}] {response.content!r}")
RuntimeError: [404] b'not found'
Then I tried to open the link "https://data.bris.ac.uk/datasets/23yv276we2mll25f" in dataset.py, line 115 download_url: str = ('https://data.bris.ac.uk/datasets/23yv276we2mll25f' 'jakkfim2ml/23yv276we2mll25fjakkfim2ml.zip')). I also got "not found" from the browser page.
Could you please help check whether the link address could work? Or give some advice about what can I do to solve this problem?
Thanks!
Hi,
thanks for your work. I'm interested on the text of claims, not only embeddings or keywords. Is it possible to obtain them in any way?
Hi,
thanks for your open source and your work is very interesting and deserves to be explored further.
Could you consider adding an use example of PyG (PyTorch-Geometric) in your code? That would make a lot of sense to me.
The error occurred when the parsing images was around 60-70 percent.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/usr/local/lib/python3.8/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.8/multiprocessing/pool.py", line 136, in worker
put((job, i, (False, wrapped)))
File "/usr/local/lib/python3.8/multiprocessing/queues.py", line 368, in put
self._writer.send_bytes(obj)
File "/usr/local/lib/python3.8/multiprocessing/connection.py", line 200, in send_bytes
self._send_bytes(m[offset:offset + size])
File "/usr/local/lib/python3.8/multiprocessing/connection.py", line 404, in _send_bytes
self._send(header)
File "/usr/local/lib/python3.8/multiprocessing/connection.py", line 368, in _send
n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe
Hello - I was running the Google Colab Notebook that maintains the tutorial on how to get started loading and compiling the MuMiN dataset. When running dataset.compile() under "2.3 Compile Dataset", I received a KeyError regarding a 'tweet_id' (below). No modifications were made to the existing code, and I was able to use a Twitter API Bearer Key. The dataset loaded is the small version.
INFO:mumin.dataset:Downloading dataset
Downloading MuMiN: 100%
200M/200M [10:42<00:00, 310kiB/s]
INFO:mumin.dataset:Converting dataset to less compressed format
INFO:mumin.dataset:Loading dataset
INFO:mumin.dataset:Shrinking dataset
INFO:mumin.dataset:Rehydrating tweet nodes
Rehydrating: 0%
0/5261 [00:09<?, ?it/s]
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
[<ipython-input-4-da99dd72c67a>](https://localhost:8080/#) in <module>
----> 1 dataset.compile()
6 frames
[/usr/local/lib/python3.7/dist-packages/pandas/core/generic.py](https://localhost:8080/#) in _get_label_or_level_values(self, key, axis)
1777 values = self.axes[axis].get_level_values(key)._values
1778 else:
-> 1779 raise KeyError(key)
1780
1781 # Check for duplicates
KeyError: 'tweet_id'
When parsing articles, it keep getting BrokenPipeError: [Errno 32] Broken pipe
, and the process was terminated.
Traceback (most recent call last):
File "/usr/local/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/usr/local/lib/python3.8/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.8/multiprocessing/pool.py", line 136, in worker
put((job, i, (False, wrapped)))
File "/usr/local/lib/python3.8/multiprocessing/queues.py", line 368, in put
self._writer.send_bytes(obj)
File "/usr/local/lib/python3.8/multiprocessing/connection.py", line 200, in send_bytes
self._send_bytes(m[offset:offset + size])
File "/usr/local/lib/python3.8/multiprocessing/connection.py", line 404, in _send_bytes
self._send(header)
File "/usr/local/lib/python3.8/multiprocessing/connection.py", line 368, in _send
n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe
Rehydrating is really time-consuming, above process ran for more than 15 hours. So I really wish adding a checkpoint after rehydrating. Thanks.
Hello,
Thank you for the great effort in this dataset.
I am running the tutorial you provided on Colab and I am trying to compile the dataset but it keeps giving me an SSL error as the following:
SSLError: HTTPSConnectionPool(host='data.bris.ac.uk', port=443): Max retries exceeded with url: /datasets/23yv276we2mll25fjakkfim2ml/23yv276we2mll25fjakkfim2ml.zip (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: certificate has expired (_ssl.c:1091)')))
I believe it's because of the URL of the dataset, how can I fix this error?
Hello! Firstly, I want to say thank you for constructing this awesome data set.
I'm trying to follow all steps of your tutorial notebook. When compiling the data set, I always got this error. I guess this may because some replies cannot be pulled to local anymore. Anyway, it doesn't affect the compiling process, the whole dataset can still be compiled.
2022-03-20 06:54:13,247 [INFO] Downloading dataset
Downloading MuMiN: 100% 200M/200M [00:05<00:00, 38.6MiB/s]
2022-03-20 06:54:22,308 [INFO] Converting dataset to less compressed format
2022-03-20 06:54:51,718 [INFO] Loading dataset
2022-03-20 06:55:00,892 [INFO] Shrinking dataset
2022-03-20 06:55:01,616 [INFO] Rehydrating tweet nodes Rehydrating: 100%
5261/5261 [00:39<00:00, 123.16it/s]
2022-03-20 06:55:41,218 [INFO] Rehydrating reply nodes Rehydrating: 80%
156600/196080 [1:19:10<16:21, 40.23it/s]
2022-03-20 07:45:05,274 [ERROR] [('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))] Error in rehydrating tweets.
The parameters used were {'expansions': 'attachments.poll_ids,attachments.media_keys,author_id,entities.mentions.username,geo.place_id,in_reply_to_user_id,referenced_tweets.id,referenced_tweets.id.author_id', 'media.fields': 'duration_ms,height,media_key,preview_image_url,type,url,width,public_metrics', 'place.fields': 'contained_within,country,country_code,full_name,geo,id,name,place_type', 'poll.fields': 'duration_minutes,end_datetime,id,options,voting_status', 'tweet.fields': 'attachments,author_id,conversation_id,created_at,entities,geo,id,in_reply_to_user_id,lang,public_metrics,possibly_sensitive,referenced_tweets,reply_settings,source,text,withheld', 'user.fields': 'created_at,description,entities,id,location,name,pinned_tweet_id,profile_image_url,protected,public_metrics,url,username,verified,withheld', 'ids': '1419766026049306630,1419179535053082626,1418866617430196228,1419818929938214918,1419819108582019073,1419759924943458306,1419751987718115333,1419815044922187779,1419813735712141319,1419746088198852616,1419670675569192969,1419813104846843904,1418874699375910914,1420165357206179841,1420054853905260545,1419750925590319108,1418149067092283397,1418153826977189891,1418062869745356801,1418054670333923328,1418048322242334720,1418735995319619589,1418904955721822218,1418738661638672385,1418738467983527939,1418739162241507331,1418926908247617543,1418736432655568898,1418753185037041667,1418737465351970819,1418751910283202560,1418738900235915268,1418739856612732935,1418739081136250888,1418736592156508165,1418737256362385415,1418737448331448320,1419028600397963276,1418739597765451779,1418739464340377602,1418738654785294337,1418737963131867139,1418736113531957251,1418738840446119947,1418737527637348353,1418737100422320130,1418748815566376963,1418740966186438658,1418738485125685249,1418736817042608133,1418749708831535106,1418738185715208198,1418763262129328137,1418738418041950208,1418739111905701898,1418740809273421826,1418739763541073920,1418737715940577284,1418738129348014080,1418738334495645705,1418740520717852678,1418740314974703622,1418736092099104769,1418735990517256193,1418735969050767361,1418709410172579843,1418705709479436293,1418710499286560768,1418713910493032449,1418699589260038146,1418763906940608518,1418699431847813125,1418701413916491779,1418700067633041409,1418699500890243072,1418868262646001664,1418698912211288064,1418703268939763713,1418699607161327616,1418699736450744327,1418702188793278469,1418761089316184067,1418709105720582148,1418698873292345345,1418730257524346882,1418712177511436290,1418700011337093128,1347720016800731136,1347248562283950081,1347238633628164098,1347382097279918080,1347262227712192514,1347233634449821699,1347235750518132738,1348752879142789120,1347360138785644544,1347267576779460608,1347554116277559298,1347475613146370048,1347231636950310912'}.
However, when following the tutorial and calling dataset.to_dgl()
function with the dataset with added embeddings, I got the following error. I guess this error is due to the missing part of the data above, since the number of all claims is 2168 and there may lack all replies of one certain claim. I'm not sure about this.
DGLError Traceback (most recent call last)
Input In [25], in <module>
1 if 'dgl_graph' not in globals():
----> 2 dgl_graph = dataset.to_dgl()
3 dgl_graph
File d:\develop\environment\python\python3-8-10\lib\site-packages\mumin\dataset.py:943, in MuminDataset.to_dgl(self)
936 '''Convert the dataset to a DGL dataset.
937
938 Returns:
939 DGLHeteroGraph:
940 The graph in DGL format.
941 '''
942 logger.info('Outputting to DGL')
--> 943 return build_dgl_dataset(nodes=self.nodes, relations=self.rels)
File d:\develop\environment\python\python3-8-10\lib\site-packages\mumin\dgl.py:193, in build_dgl_dataset(nodes, relations)
191 rev_embs = emb_to_tensor(nodes['claim'], 'reviewer_emb')
192 tensors = (claim_embs, rev_embs)
--> 193 dgl_graph.nodes['claim'].data['feat'] = torch.cat(tensors, dim=1)
194 else:
195 dgl_graph.nodes['claim'].data['feat'] = claim_embs
File d:\develop\environment\python\python3-8-10\lib\site-packages\dgl\view.py:84, in HeteroNodeDataView.__setitem__(self, key, val)
80 else:
81 assert isinstance(val, dict) is False, \
82 'The HeteroNodeDataView has only one node type. ' \
83 'please pass a tensor directly'
---> 84 self._graph._set_n_repr(self._ntid, self._nodes, {key : val})
File d:\develop\environment\python\python3-8-10\lib\site-packages\dgl\heterograph.py:4118, in DGLHeteroGraph._set_n_repr(self, ntid, u, data)
4116 nfeats = F.shape(val)[0]
4117 if nfeats != num_nodes:
-> 4118 raise DGLError('Expect number of features to match number of nodes (len(u)).'
4119 ' Got %d and %d instead.' % (nfeats, num_nodes))
4120 if F.context(val) != self.device:
4121 raise DGLError('Cannot assign node feature "{}" on device {} to a graph on'
4122 ' device {}. Call DGLGraph.to() to copy the graph to the'
4123 ' same device.'.format(key, F.context(val), self.device))
DGLError: Expect number of features to match number of nodes (len(u)). Got 2168 and 2167 instead.
Is there anything I can do to solve the second error? If the second error is really caused by this reason, is there a proper way to remove this certain claim and all data related to it in the compiled dataset?
I have been working with a downloaded version of the Mumin Medium dataset to try and retrieve user information based on replies they have made.
I keep trying to retrieve a User's Row ID via the 'user_reply_to_tweet' relationship dataset and there is no corresponding index for some users.
For example:
In the (reply, reply_to, tweet) relationship dataset the fifth src value is 406564.
I then try to retrieve the User's Row ID via the (user, posted, reply) relationship dataset using 406564 as the tgt value.
The row cannot be found, upon closer inspection the dataset ends with the highest row value being 406562.
I then looked into this further and have found 7201 rows failing to connect a reply to a user.
Any help you can give in resolving this would be great. Is this a result of users removing their accounts?
I have attached the code I have used below:
bearer_token = ""
dataset = MuminDataset(
twitter_bearer_token="",
size="medium",
include_replies=True,
include_articles=False,
include_hashtags=False,
dataset_path="mumin-medium.zip",
include_tweet_images=False,
include_extra_images=False
)
# Compile
dataset.compile()
# Get all the users
users_list = dataset.nodes["user"]
# Get the reply and tweet rel connection values
reply_reply_to_tweet = dataset.rels[('reply', 'reply_to', 'tweet')]
reply_reply_to_tweet.columns = ["rrt_" + c for c in list(reply_reply_to_tweet.columns)]
# Get the user posted reply rel connection values
user_posted_reply = dataset.rels[('user', 'posted', 'reply')]
user_posted_reply.columns = ["upr_" + c for c in list(user_posted_reply.columns)]
# Join both relational datasets to retrieve the user rows for each reply
merged_dataset = pd.merge(reply_reply_to_tweet, user_posted_reply, how="outer", left_on=["rrt_src"], right_on=["upr_tgt"])
# Locate rows with no connection
merged_dataset_only_NaN = merged_dataset.loc[merged_dataset["upr_tgt"].isnull()]
print(merged_dataset)
print(merged_dataset_only_NaN)
Hi everyone!
I can't compile mumin dataset. I had installed the lib, put my twitter bearer token to create the dataset object but when I try to compile I receive two error messages. The first one says something related to the twitter token, but I have already tested the same token in other situations and it works. The second message says that couldn't find tweet_id. Can you help me?
FIRST ERROR MESSAGE (I hide my client_id using 'xxxxx'):
/2023-06-29 20:36:38,903 [INFO] Loading dataset 2023-06-29 20:36:45,029 [INFO] Shrinking dataset 2023-06-29 20:36:46,178 [INFO] Rehydrating tweet nodes Rehydrating: 0%| | 0/5261 [00:00<?, ?it/s]2023-06-29 20:36:46,475 [ERROR] [403] {"client_id":"xxxxx","detail":"When authenticating requests to the Twitter API v2 endpoints, you must use keys and tokens from a Twitter developer App that is attached to a Project. You can create a project via the developer portal.","registration_url":"https://developer.twitter.com/en/docs/projects/overview","title":"Client Forbidden","required_enrollment":"Appropriate Level of API Access","reason":"client-not-enrolled","type":"https://api.twitter.com/2/problems/client-forbidden"}
SECOND ERROR MESSAGE
`
KeyError Traceback (most recent call last)
Cell In[25], line 1
----> 1 mumin_small.compile()
File ~\AppData\Roaming\Python\Python311\site-packages\mumin\dataset.py:251, in MuminDataset.compile(self, overwrite)
248 self._shrink_dataset()
250 # Rehydrate the tweets
--> 251 self._rehydrate(node_type='tweet')
252 self._rehydrate(node_type='reply')
254 # Update the IDs of the data that was there pre-hydration
File ~\AppData\Roaming\Python\Python311\site-packages\mumin\dataset.py:553, in MuminDataset._rehydrate(self, node_type)
549 self.nodes['user'] = user_df
551 # Add prehydration tweet features back to the tweets
552 self.nodes[node_type] = (self.nodes[node_type]
--> 553 .merge(prehydration_df,
554 on='tweet_id',
555 how='outer')
556 .reset_index(drop=True))
558 # Extract and store images
559 # Note: This will store self.nodes['image']
, but this is only
560 # to enable extraction of URLs later on. The
561 # self.nodes['image']
will be overwritten later on.
562 if (node_type == 'tweet' and self.include_tweet_images and
563 len(source_tweet_dfs['media'])):
File ~\AppData\Roaming\Python\Python311\site-packages\pandas\core\frame.py:10090, in DataFrame.merge(self, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator, validate)
10071 @substitution("")
10072 @appender(_merge_doc, indents=2)
10073 def merge(
(...)
10086 validate: str | None = None,
10087 ) -> DataFrame:
10088 from pandas.core.reshape.merge import merge
10090 return merge(
10091 self,
10092 right,
10093 how=how,
10094 on=on,
10095 left_on=left_on,
10096 right_on=right_on,
10097 left_index=left_index,
10098 right_index=right_index,
10099 sort=sort,
10100 suffixes=suffixes,
10101 copy=copy,
10102 indicator=indicator,
10103 validate=validate,
10104 )
File ~\AppData\Roaming\Python\Python311\site-packages\pandas\core\reshape\merge.py:110, in merge(left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator, validate)
93 @substitution("\nleft : DataFrame or named Series")
94 @appender(_merge_doc, indents=0)
95 def merge(
(...)
108 validate: str | None = None,
109 ) -> DataFrame:
--> 110 op = _MergeOperation(
111 left,
112 right,
113 how=how,
114 on=on,
115 left_on=left_on,
116 right_on=right_on,
117 left_index=left_index,
118 right_index=right_index,
119 sort=sort,
120 suffixes=suffixes,
121 indicator=indicator,
122 validate=validate,
123 )
124 return op.get_result(copy=copy)
File ~\AppData\Roaming\Python\Python311\site-packages\pandas\core\reshape\merge.py:703, in _MergeOperation.init(self, left, right, how, on, left_on, right_on, axis, left_index, right_index, sort, suffixes, indicator, validate)
696 self._cross = cross_col
698 # note this function has side effects
699 (
700 self.left_join_keys,
701 self.right_join_keys,
702 self.join_names,
--> 703 ) = self._get_merge_keys()
705 # validate the merge keys dtypes. We may need to coerce
706 # to avoid incompatible dtypes
707 self._maybe_coerce_merge_keys()
File ~\AppData\Roaming\Python\Python311\site-packages\pandas\core\reshape\merge.py:1179, in _MergeOperation._get_merge_keys(self)
1175 if lk is not None:
1176 # Then we're either Hashable or a wrong-length arraylike,
1177 # the latter of which will raise
1178 lk = cast(Hashable, lk)
-> 1179 left_keys.append(left._get_label_or_level_values(lk))
1180 join_names.append(lk)
1181 else:
1182 # work-around for merge_asof(left_index=True)
File ~\AppData\Roaming\Python\Python311\site-packages\pandas\core\generic.py:1850, in NDFrame._get_label_or_level_values(self, key, axis)
1844 values = (
1845 self.axes[axis]
1846 .get_level_values(key) # type: ignore[assignment]
1847 ._values
1848 )
1849 else:
-> 1850 raise KeyError(key)
1852 # Check for duplicates
1853 if values.ndim > 1:
KeyError: 'tweet_id'
`
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.