Comments (61)
I think r/NoStupidQuestions , r/AskReddit , r/answers , r/ExplainLikeImFive and r/AskScience are really good for collecting this kind of data
from open-assistant.
From a previous project of mine I have all the reddit comments and submissions on pushshift from 2005-12 to 2021-06 stored on a local server, as well as some code to scrape it. It may be easier for me to scrape the data locally and submit it as a json. The code I have is originally adapted from DialoGPT's reddit extractor, it may be helpful to give it a look. https://github.com/microsoft/DialoGPT
from open-assistant.
Guys, do you need help speeding up parsing? I can step in and try to help you.
Parsing is not needed as the data is in JSON (python dictionary) but accessing what we need is needed. Have you worked with hyperjson or orjson?
@yk Yes, we can make a beautiful CLI wrapper. What I have now are just prototypes
from open-assistant.
To @yk: @danielpwarren has downloaded files from pushshift from 2005-12 to 2021-06 on a local server. He has code adaptation from DialoGPT. We could adopt it.
From my end, I have end-to-end flow now but unlike DialoGPT, it does not have data preprocessing. So we are good to go if we can use DialoGPT Daniel's adoption. The only left task will be qualifying good questions and answers.
from that we could get JSON [{question:
answer1:
answer2:
answer3:},
{question:
....
}]
from open-assistant.
I can write the multiprocessing version of this, which can speed up matching, just attach full file with code
from open-assistant.
import orjson as json
from collections.abc import Generator
from io import TextIOWrapper
from pathlib import Path
import pandas as pd
from zstandard import ZstdDecompressor, open as zopen
import asyncio
from asyncio.events import AbstractEventLoop
from concurrent.futures import ProcessPoolExecutor
from functools import partial
from itertools import tee
def smart_open(file_path: Path) -> Generator[str]:
"""
Use:
```python
import orjson as json
from pathlib import Path
blobs = map(json.loads, smart_open(file_path=Path(example.zst)))
needed = {blob.get("needed") for blob in blobs}
```
"""
DCTX = ZstdDecompressor(max_window_size=2**31)
with zopen(file_path, mode="rb", dctx=DCTX) as z, TextIOWrapper(z) as f:
for blob in f:
yield blob
def filter_submissions(submission_blobs, subreddit, num_comments):
# get 101 submissions with num_comments >= 10
break_point = 100
datas_list = []
for blob in submission_blobs:
if break_point < 0:
break
if (blob["subreddit"] == subreddit and
blob["num_comments"] >= num_comments):
print(".", end="")
break_point -= 1
datas_list.append(blob)
# get the ids
ids = set(b.get("name") for b in datas_list)
print(f"we have {len(ids)} unique ids")
return ids
#this takes long just to get 10
def matching(comments_chunk, ids, subreddit):
break_point = 10
datac_list = []
for blob in comments_chunk:
if blob["subreddit"] != subreddit:
continue
if break_point < 0:
break
if blob["parent_id"] in ids:
print(".", end="")
break_point -= 1
datac_list.append(blob)
return datac_list
def generate_chunk(iterable, chunk_len=100):
chunk = []
for i, item in enumerate(iterable):
if i % chunk_len == 0:
yield chunk
chunk = []
chunk.append(item)
async def main(ids, subbredit):
with ProcessPoolExecutor() as process_pool:
loop: AbstractEventLoop = asyncio.get_running_loop()
calls = [partial(matching, comment_chunk, ids, subbredit) for comment_chunk in generate_chunk(comment_blobs_copy)]
call_coros = []
for call in calls:
call_coros.append(loop.run_in_executor(process_pool, call))
results = await asyncio.gather(*call_coros)
merged_result = []
for chunk_result in results:
merged_result += chunk_result
return merged_result
if __name__ == '__main__':
DATA_DIR = Path("./data") #Path("../data")
submission_objects, comment_objects, comment_objects_copy = tee(smart_open(DATA_DIR / "RC_2009-04.zst"), 3)
submission_blobs = map(json.loads, submission_objects)
comment_blobs = map(json.loads, comment_objects)
comment_blobs_copy = map(json.loads, comment_objects_copy)
# params
subreddit = "whatisthisthing"
num_comments = 10
ids = filter_submissions(submission_blobs, subreddit, num_comments)
matched_comments = asyncio.run(main(ids, subreddit))
print(matched_comments)
from open-assistant.
Yeah these one
These ones:
r/NoStupidQuestions r/AskReddit r/answers r/ExplainLikeImFive r/AskScience?
You could add
/r/changemyview
/r/tipofmytongue
/r/askculinary
/r/AskAcademia
/r/AskAnthropology
/r/AskAstronomy
/r/AskElectronics
/r/AskEngineers
/r/AskHistorians
/r/AskPhilosophy
/r/AskPhysics
/r/AskScienceFiction
/r/AskSocialScience
/r/AskStatistics
/r/HomeworkHelp
/r/ChemHelp
/r/Estimation
/r/MathHelp
/r/AskRedditAfterDark
/r/TooAfraidToAsk
Should I research some more?
from open-assistant.
Thanks for bringing this to my attention @bitplane. I ended up parsing all of the pushshift files and they are now in bigquery. If anyone wants access to the raw data, send me a message on discord with your email - Player 1#4315
I still use scraping to collect the top 5 comments and scores for each post. The reddit API provides all comments for a post which results in more data to process. All of r/confessions took about 25GB of bandwidth.
Here is the data for r/confessions
https://www.kaggle.com/datasets/noahpersaud/reddit-confessions-oa
from open-assistant.
if this issue is not assigned to anyone , I would like to work on it
from open-assistant.
I am also available to pick this one @SriPrarabdha. We could also work together?
from open-assistant.
Hey, thanks a lot :) I've assigned both of you, feel free to work separately or together.
Remember, we're mainly interested in the scraping and parsing code and some instructions on how to run it all. We have infrastructure to do the data collection and storage, so not really a need on your side to do that part, it's really more about how to obtain and handle the data.
from open-assistant.
@Proteusiq that sounds great! How do you want to get started with this?
from open-assistant.
@Proteusiq that sounds great! How do you want to get started with this?
I have tomorrow. I could start with a prototype and add snippets here and we can see how to go about. What say you?
from open-assistant.
Yeah for sure👍
@Proteusiq that sounds great! How do you want to get started with this?
I have tomorrow. I could start with a prototype and add snippets here and we can see how to go about. What say you?
from open-assistant.
Path to getting data. I have tested with Postman: We can use requests or httpx Sessions
GET e.g.
https://api.pushshift.io/reddit/search/submission?subreddit=whatisthisthing&size=10
DATA can be gathered in time buckets with before
and after
params. I will upload a snippet code tomorrow
from open-assistant.
can both of you DM me somehow? discord, twitter, all good :) makes coordination easier
from open-assistant.
can both of you DM me somehow? discord, twitter, all good :) makes coordination easier
Alrighty👍
from open-assistant.
@SriPrarabdha can you collect initial list of subreddits?
from open-assistant.
I've already shared some the subreddits that we can use and will update if I find some new ones
from open-assistant.
These ones:
r/NoStupidQuestions
r/AskReddit
r/answers
r/ExplainLikeImFive
r/AskScience
?
from open-assistant.
Yeah these one
These ones:
r/NoStupidQuestions r/AskReddit r/answers r/ExplainLikeImFive r/AskScience?
from open-assistant.
I have collected initial data in JSON Form while preserving the graph structure of these comments. How should I share it with you guys to have a look!
from open-assistant.
I have collected initial data in JSON Form while preserving the graph structure of these comments. How should I share it with you guys to have a look!
upload here or discord.
do you have code for this somewhere in a fork?
from open-assistant.
I have put together the code and JSON file in this repo https://github.com/SriPrarabdha/Reddit-Scrapper
But the main problem is parsing one post on a subreddit with 15K comments took around 25 minutes. So even scrapping 1 subreddit completely will take a long time
from open-assistant.
@SriPrarabdha I think you are after something. We can always make the scrapper faster. Update on https://api.pushshift.io/reddit/comments/
import pandas as pd
from httpx import Client
HEADERS = {"User-Agent": "Prayson W. Daniel <[email protected]>"}
BASE_URI = "https://api.pushshift.io/reddit"
timeout = 60 # seconds
subreddit = "whatisthisthing"
size = 10
score = 20
num_comments = 10 # has no effect
with Client(base_url=BASE_URI, headers=HEADERS) as request:
print("Fetching submission")
s = request.get(url="/search/submission",
params=params,
timeout=timeout)
print("Fetching comments")
_ids = ",".join(item.get('id') for item in s.json().get("data"))
params.update({"ids":_ids})
c = request.get(url="/search/comment",
params=params,
timeout=timeout)
# Return only needed columns with `fields`
# merge the submission to the comments
datac = pd.DataFrame(c.json().get('data'))
datas = pd.DataFrame(s.json().get('data'))
I will try downloading files instead from https://files.pushshift.io.
The are huge: RC 2022-10 => 23.8 GB and RS => 9.5.
from open-assistant.
@yk and @SriPrarabdha: Updates on files: It is possible to get data offline: I download RC and RS files for tests. This is where I am:
import json
from collections.abc import Generator
from io import TextIOWrapper
from pathlib import Path
import pandas as pd
from zstandard import ZstdDecompressor, open as zopen
def smart_open(file_path: Path) -> Generator[str]:
"""
Use:
```python
import json
from pathlib import Path
blobs = map(json.loads, smart_open(file_path=Path(example.zst)))
needed = {blob.get("needed") for blob in blobs}
```
"""
DCTX = ZstdDecompressor(max_window_size=2**31)
with zopen(file_path, mode="rb", dctx=DCTX) as z, TextIOWrapper(z) as f:
for blob in f:
yield blob
DATA_DIR = Path("../data")
submission_objects = smart_open(DATA_DIR / "RS_2022-10.zst")
submission_blobs = map(json.loads, submission_objects)
subreddit = "whatisthisthing"
num_comments = 10
# working on finding a faster or better way to do this
datas_gen = (blob for blob in blobs
if (blob["subreddit"] == subreddit and
blob["num_comments"] >= num_comments)
)
data = pd.DataFrame(datas_gen)
The idea is to get ids and questions from the submission and their comments from comments. Merge and groupby id order by reply time on the comments.
from open-assistant.
looks pretty neat so far, nice work! is there a chance we could use something like typer
or so, to make this into a script that takes flags to define things like data location etc?
from open-assistant.
Guys, do you need help to speed up parsing?
I can step in and try to help you.
from open-assistant.
Parsing is not needed as the data is in JSON (python dictionary) but accessing what we need is needed. Have you worked with hyperjson or orjson?
Actually, didn't have a chance to work with these libraries. But, It's not late to learn something new
from open-assistant.
Also, what kind of trees you want to build from the json representations?
from open-assistant.
Also, what kind of trees you want to build from the json representations?
Something like:
id "ABC", submission: "What happened to Batman?"
In comments, we fetch comments where id = "ABC"
sort the comments by time of reply
id "ABC", submission: "What happened to Batman?" Time 10:30
id "ABC", comment: "Because Catwoman happened" Time 10:45
id "ABC", comment: "No way" Time 10:46
So we have replay as they come in. The tree is from submission -> earliers_comments
Sometimes the comments can branch out to others own comments ...
Updates: Using generator allows me to keep calling and stoping using Jupyter: Getting submission is fast but matching them to comment takes forever
# instead of json
import orjson as json
...
break_point = 100
datas_list = []
for blob in blobs:
if break_point < 0:
break
if (blob["subreddit"] == subreddit and
blob["num_comments"] >= num_comments):
print(".", end="")
break_point -= 1
datas_list.append(blob)
ids = set(b.get("id") for b in datas_list)
print(f"number of {ids=}")
com_objects = smart_open(DATA_DIR / "RC_2022-10.zst")
blobc = map(json.loads, com_objects)
## just to see how long it takes to get 10 match :(
break_point = 10
datac_list = []
for blob in blobc:
if blob["subreddit"] != subreddit:
continue
if break_point < 0:
break
print(".", end="")
if blob["id"] in ids:
print("X", end="")
break_point -= 1
datac_list.append(blob)
...
could be I am matching on the wrong things. Maybe in the comments, I need parent_id. I will keep one searching
from open-assistant.
I can write the multiprocessing version of this, which can speed up matching, just attach full file with code
Super! I got it working now. In submission, I needed "name", and in comments "parent_id"
Notes: prints are just for debugging… needs to be removed
Full code
import orjson as json
from collections.abc import Generator
from io import TextIOWrapper
from pathlib import Path
import pandas as pd
from zstandard import ZstdDecompressor, open as zopen
def smart_open(file_path: Path) -> Generator[str]:
"""
Use:
```python
import orjson as json
from pathlib import Path
blobs = map(json.loads, smart_open(file_path=Path(example.zst)))
needed = {blob.get("needed") for blob in blobs}
```
"""
DCTX = ZstdDecompressor(max_window_size=2**31)
with zopen(file_path, mode="rb", dctx=DCTX) as z, TextIOWrapper(z) as f:
for blob in f:
yield blob
DATA_DIR = Path("../data")
submission_objects = smart_open(DATA_DIR / "RS_2022-10.zst")
comment_objects = smart_open(DATA_DIR / "RC_2022-10.zst")
submission_blobs = map(json.loads, submission_objects)
comment_blobs = map(json.loads, comment_objects)
# params
subreddit = "whatisthisthing"
num_comments = 10
# get 101 submissions with num_comments >= 10
break_point = 100
datas_list = []
for blob in submission_blobs:
if break_point < 0:
break
if (blob["subreddit"] == subreddit and
blob["num_comments"] >= num_comments):
print(".", end="")
break_point -= 1
datas_list.append(blob)
# get the ids
ids = set(b.get("name") for b in datas_list)
print(f"we have {len(ids)} unique ids"}
# this takes long just to get 10
break_point = 10
datac_list = []
for blob in comment_blobs:
if blob["subreddit"] != subreddit:
continue
if break_point < 0:
break
if blob["parent_id"] in ids:
print(".", end="")
break_point -= 1
datac_list.append(blob)
# merging of data ...
from open-assistant.
From a previous project of mine I have all the reddit comments and submissions on pushshift from 2005-12 to 2021-06 stored on a local server, as well as some code to scrape it. It may be easier for me to scrape the data locally and submit it as a json. The code I have is originally adapted from DialoGPT's reddit extractor, it may be helpful to give it a look. https://github.com/microsoft/DialoGPT
That would be perfect 😍: looks like we are reinventing the wheel https://github.com/microsoft/DialoGPT/blob/master/reddit_extractor/src/reddit.py
from open-assistant.
Made some refactoring and please update your DATA_DIR and smart_open
pathes.
If it's still relevant
Also, I think it's better to make bigger chunk_len
(about 50000)
from open-assistant.
Hi, I would like to help. I am following, this is great progress so far. Maybe go after some other sources of data while you are focused on Reddit. My question @yk , @Proteusiq is what is the format we wish to end up with , is it a JSON schema, have we determined that, on that or is that something we are working towards. I am familiar with web scarping etc. but not familiar with NLP and what an ideal format for the data is. I understand the MVP objective tho, so if we can have some clarity, I could go look for other potential sources that might work for the "question>answer-thread" conversational objective, and get them scraped and
formatted correctly. thanks
from open-assistant.
Hi, I would like to help. I am following, this is great progress so far. Maybe go after some other sources of data while you are focused on Reddit. My question @yk , @Proteusiq is what is the format we wish to end up with , is it a JSON schema, have we determined that, on that or is that something we are working towards. I am familiar with web scarping etc. but not familiar with NLP and what an ideal format for the data is. I understand the MVP objective tho, so if we can have some clarity, I could go look for other potential sources that might work for the "question>answer-thread" conversational objective, and get them scraped and
formatted correctly. thanks
yes I think a common json schema (or parquet, protobuf, or something) totally makes sense. @lewtun what do you think?
from open-assistant.
@yk and @Proteusiq I have made a simple typer CLI application and made it available on PyPI- https://pypi.org/project/reddit-comment-scrapper/
Any Suggestions on how to make it better?
looks pretty neat so far, nice work! is there a chance we could use something like
typer
or so, to make this into a script that takes flags to define things like data location etc?
from open-assistant.
From my end, I have end-to-end flow now but unlike DialoGPT, it does not have data preprocessing. So we are good to go if we can use DialoGPT Daniel's adoption. The only left task will be qualifying good questions and answers.
sweet, thank you very much! make sure to retain DialoGPT's MIT header :)
Once you're done, could you make a PR with the code? @lewtun any comments on how & where?
from open-assistant.
I've put my modified code and put it up on danielpwarren/reddit-extractor. It's not great and I don't have much time to work on it atm. I'll run it locally with the aforementioned subreddits and post here when it's done. The data currently is output in tsv format and there's an example in the repo.
from open-assistant.
In #282 @andrewm4894 suggests r/amitheasshole
Could be a way to convert this into more structured training data that actually might encode a lot of nuance.
There is lots of rules and hurustics to that subreddit such that would could extract or convert it into a sort of soft label type dataset that maybe could be useful.
Apologies if this is a dupe as am sure reddit data already on roadmap, more so that there could be a subset of subreddits that could be enriched or transformed in some way to make them even more useful.
from open-assistant.
For data sources like this - would/could/should we have some sort of example dummy data as a sort of target of what is needed in terms of format or structure before we do any work on it?
I can imagine there will be a lot of issues getting created with source suggestions and it could maybe be useful or help cut down on noise if there was some clear "target templates" or something that people could try stick to?
Still only getting up to speed so apologies if this is already done or perhaps might create too much friction right now - thoughts?
from open-assistant.
For data sources like this - would/could/should we have some sort of example dummy data as a sort of target of what is needed in terms of format or structure before we do any work on it?
probably @lewtun is the person to talk to for this
from open-assistant.
@SriPrarabdha or @Proteusiq - can we get a sample set of data (< 100) to see if we can convert into instructions?
from open-assistant.
@Proteusiq and
@SriPrarabdha
checking on status. thank you!
from open-assistant.
Hej @ontocord
I saw the issue closed, so I assumed that @danielpwarren way was the path forward. @danielpwarren do you have the samples? Otherwise, I could extract from my script Tomorrow.
from open-assistant.
@Proteusiq issue is still open.
from open-assistant.
@Proteusiq is this issue still active?
if so I'd like to contribute
from open-assistant.
Yes, it is. We are missing CLI part
from open-assistant.
@Proteusiq sorry for the late replay but what's exactly needed for us to produce a usable dataset? from what I can tell @danielpwarren has a very sophisticated workflow
from open-assistant.
@Proteusiq sorry for the late replay but what's exactly needed for us to produce a usable dataset? from what I can tell @danielpwarren has a very sophisticated workflow
Hi, @Anan-Saadi
We are missing two things:
- data to correct format
- typer cli which will accept a toml file of instructions
Data Format
- JSON [{question:
answer1:
answer2:
answer3:},
{question:
....
}]
CLI
- accepts a toml file that contains a list of subredits and two downloaded files (submission and comment)
We have a started code already, my work just keeps me busy to complete...
from open-assistant.
@Proteusiq Ok I'll see what I can do in the coming few days
from open-assistant.
Lots of candidate subreddits here:
https://redditlist.com/search?adultfilter=0&searchterm=ask
from open-assistant.
Hey there @Proteusiq @SriPrarabdha I was wondering if you guys had any updates on this? I would love to help with this too since I need a similar setup to scale up #1967. I was currently using PRAW and api.pushshift but reached the limits of how much data we can get from it. Using the files from files.pushshift as you guys discussed seems like a very good idea and I was wondering if we can work on this together.
@Proteusiq do you have a PR or repo with the code?
from open-assistant.
No! The code is in this thread above. ☝️
from open-assistant.
So I'm not a dev but I've been able to scrape some other things in the last days with the help of Bing.
Why aren't you guys using PRAW, beautifulsoup or Scrapy instead?
And can someone summarize the following: what's the desired output format; what subreddits should be scraped (apart from the ones listed) and what criteria should the scraping have (e.g top 1000 posts from each subreddit or with specific tags or keywords)?
If someone can answer these thing I think that I should be able to do it if someone explains to me where I upload the data since I expect that there's going to be a lot.
from open-assistant.
I have a working version that bypasses the rate limit, provides access to all content (NSFW and non-NSFW) and can scrape whole subreddits. I used scrapy.
Right now the program scrapes a post and then the post's top 5 replies. I am currently working on integrating a database to avoid duplicates.
from open-assistant.
Why aren't you guys using PRAW, beautifulsoup or Scrapy instead?
Usually you'd want to use an API where possible, it's the more correct and respectful thing to do (saves rendering resources). Also API designers have probably thought of all sorts of edge cases that you'd figure out bit by bit while writing a scraper.
But that said, we've got monthly Reddit dumps on files.pushshift.io that look exhaustive, so it's best to just use them. Not sure if they contain quarantined subreddits, which would be useful for filtering things. But IMO it's best to just transform everything on Reddit into our format in one dataset, make it work in an incremental way as new months are added, and filter it for different uses as needed.
from open-assistant.
Right now the program scrapes a post and then the post's top 5 replies. I am currently working on integrating a database to avoid duplicates.
The API does this, and the files.pushshift.io data has dupes removed. I strongly recommend using that instead if possible
from open-assistant.
I realize now that I do not make the kaggle dataset public.. Here is the proper link https://www.kaggle.com/datasets/noahpersaud/reddit-confessions-oa
Also, here is aita https://www.kaggle.com/datasets/noahpersaud/reddit-amitheasshole-oa.
Here is the link to the scraper repo: https://github.com/P1ayer-1/Reddit-Convo-Tree-Builder
I didn't have time to do a proper readme so it isn't great and I might've forgotten stuff. So just let me know if anyone runs into any issues.
Here is the csv file I used for aita that can serve as a demo until full pushshift csvs are uploaded.
https://www.kaggle.com/datasets/noahpersaud/176-million-ramitheasshole-submissions
I intend to upload my raw parsed pushshift files on Kaggle soon.
from open-assistant.
https://commoncrawl.org can also be used to get reddit dump.
from open-assistant.
Closing old data issue that has not been completed by now.
from open-assistant.
Related Issues (20)
- Bug chat HOT 3
- Scraping the Deep web HOT 2
- Dead? (In memory of Open-Assistant) HOT 7
- The Biggest Problem with Open Assistant Right Now HOT 6
- when i click on start new message it doesn'[t click HOT 1
- Open AI doesn't work for me HOT 1
- Create a New Chat HOT 4
- Questions about what's going on with Open-Assistant? please watch. HOT 2
- Chat doesn't open HOT 1
- text to speech HOT 1
- Unable to create new chat HOT 6
- Dear ladies and gentlemen, I click on the button "create a chat" but it doesn't work at all. Could you please solve this problem and help me ? Best regards Ehsan Pazooki HOT 1
- Not able to access the chat dashboard HOT 1
- Open assistant registration error
- Dashboard not working in the official website. HOT 2
- /dashboard exit HOT 1
- chat frontend no longer active, fix readme HOT 2
- Can't open dashboard HOT 1
- Sign on malfunction HOT 1
- Not able to get to the dashboard HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from open-assistant.