Git Product home page Git Product logo

Comments (61)

SriPrarabdha avatar SriPrarabdha commented on May 2, 2024 5

I think r/NoStupidQuestions , r/AskReddit , r/answers , r/ExplainLikeImFive and r/AskScience are really good for collecting this kind of data

from open-assistant.

danielpwarren avatar danielpwarren commented on May 2, 2024 4

From a previous project of mine I have all the reddit comments and submissions on pushshift from 2005-12 to 2021-06 stored on a local server, as well as some code to scrape it. It may be easier for me to scrape the data locally and submit it as a json. The code I have is originally adapted from DialoGPT's reddit extractor, it may be helpful to give it a look. https://github.com/microsoft/DialoGPT

from open-assistant.

Proteusiq avatar Proteusiq commented on May 2, 2024 2

Guys, do you need help speeding up parsing? I can step in and try to help you.

Parsing is not needed as the data is in JSON (python dictionary) but accessing what we need is needed. Have you worked with hyperjson or orjson?

@yk Yes, we can make a beautiful CLI wrapper. What I have now are just prototypes

from open-assistant.

Proteusiq avatar Proteusiq commented on May 2, 2024 2

To @yk: @danielpwarren has downloaded files from pushshift from 2005-12 to 2021-06 on a local server. He has code adaptation from DialoGPT. We could adopt it.

From my end, I have end-to-end flow now but unlike DialoGPT, it does not have data preprocessing. So we are good to go if we can use DialoGPT Daniel's adoption. The only left task will be qualifying good questions and answers.

image

from that we could get JSON [{question:
answer1:
answer2:
answer3:},
{question:
....
}]

from open-assistant.

doroshroman avatar doroshroman commented on May 2, 2024 1

I can write the multiprocessing version of this, which can speed up matching, just attach full file with code

from open-assistant.

doroshroman avatar doroshroman commented on May 2, 2024 1
import orjson as json
from collections.abc import Generator
from io import TextIOWrapper
from pathlib import Path
import pandas as pd
from zstandard import ZstdDecompressor, open as zopen
import asyncio
from asyncio.events import AbstractEventLoop
from concurrent.futures import ProcessPoolExecutor
from functools import partial
from itertools import tee


def smart_open(file_path: Path) -> Generator[str]:
    """
    Use:
    ```python
    import orjson as json
    from pathlib import Path
    
    blobs = map(json.loads, smart_open(file_path=Path(example.zst)))
    needed = {blob.get("needed") for blob in blobs}
    ```
    """
    DCTX = ZstdDecompressor(max_window_size=2**31)
    with zopen(file_path, mode="rb", dctx=DCTX) as z, TextIOWrapper(z) as f:
        for blob in f:
            yield blob


def filter_submissions(submission_blobs, subreddit, num_comments):
# get 101 submissions with num_comments >= 10
    break_point = 100
    datas_list = [] 
    for blob in submission_blobs:
        if break_point < 0:
            break
        
        if (blob["subreddit"] == subreddit and 
            blob["num_comments"] >= num_comments):
            print(".", end="")
            break_point -= 1
            datas_list.append(blob)

    # get the ids
    ids = set(b.get("name") for b in datas_list)
    print(f"we have {len(ids)} unique ids")
    
    return ids


#this takes long just to get 10
def matching(comments_chunk, ids, subreddit):
    break_point = 10
    datac_list = [] 
    for blob in comments_chunk:
        if blob["subreddit"] != subreddit:
            continue
        
        if break_point < 0:
            break
        if blob["parent_id"] in ids:
            print(".", end="")
            break_point -= 1
            datac_list.append(blob)
            
    return datac_list


def generate_chunk(iterable, chunk_len=100):
    chunk = []
    for i, item in enumerate(iterable):
        if i % chunk_len == 0:
            yield chunk
            chunk = []
        chunk.append(item)


async def main(ids, subbredit):
    with ProcessPoolExecutor() as process_pool:
        loop: AbstractEventLoop = asyncio.get_running_loop()
        calls = [partial(matching, comment_chunk, ids, subbredit) for comment_chunk in generate_chunk(comment_blobs_copy)]
        call_coros = []
        
        
        for call in calls:
            call_coros.append(loop.run_in_executor(process_pool, call))
            
        results = await asyncio.gather(*call_coros)
        
        merged_result = []
        for chunk_result in results:
            merged_result += chunk_result
            
    return merged_result


if __name__ == '__main__':
    DATA_DIR = Path("./data") #Path("../data")
    submission_objects, comment_objects, comment_objects_copy = tee(smart_open(DATA_DIR / "RC_2009-04.zst"), 3)

    submission_blobs = map(json.loads, submission_objects)
    comment_blobs = map(json.loads, comment_objects)
    comment_blobs_copy = map(json.loads, comment_objects_copy)

    # params
    subreddit = "whatisthisthing"
    num_comments = 10
    
    ids = filter_submissions(submission_blobs, subreddit, num_comments)

    matched_comments = asyncio.run(main(ids, subreddit))
    print(matched_comments)
        

from open-assistant.

michaelbogdan avatar michaelbogdan commented on May 2, 2024 1

Yeah these one

These ones:

r/NoStupidQuestions 
r/AskReddit
r/answers
r/ExplainLikeImFive
r/AskScience

?

You could add

/r/changemyview
/r/tipofmytongue
/r/askculinary
/r/AskAcademia
/r/AskAnthropology
/r/AskAstronomy
/r/AskElectronics
/r/AskEngineers
/r/AskHistorians
/r/AskPhilosophy
/r/AskPhysics
/r/AskScienceFiction
/r/AskSocialScience
/r/AskStatistics
/r/HomeworkHelp
/r/ChemHelp
/r/Estimation
/r/MathHelp
/r/AskRedditAfterDark
/r/TooAfraidToAsk

Should I research some more?

from open-assistant.

P1ayer-1 avatar P1ayer-1 commented on May 2, 2024 1

Thanks for bringing this to my attention @bitplane. I ended up parsing all of the pushshift files and they are now in bigquery. If anyone wants access to the raw data, send me a message on discord with your email - Player 1#4315

I still use scraping to collect the top 5 comments and scores for each post. The reddit API provides all comments for a post which results in more data to process. All of r/confessions took about 25GB of bandwidth.

Here is the data for r/confessions
https://www.kaggle.com/datasets/noahpersaud/reddit-confessions-oa

from open-assistant.

SriPrarabdha avatar SriPrarabdha commented on May 2, 2024

if this issue is not assigned to anyone , I would like to work on it

from open-assistant.

Proteusiq avatar Proteusiq commented on May 2, 2024

I am also available to pick this one @SriPrarabdha. We could also work together?

from open-assistant.

yk avatar yk commented on May 2, 2024

Hey, thanks a lot :) I've assigned both of you, feel free to work separately or together.

Remember, we're mainly interested in the scraping and parsing code and some instructions on how to run it all. We have infrastructure to do the data collection and storage, so not really a need on your side to do that part, it's really more about how to obtain and handle the data.

from open-assistant.

SriPrarabdha avatar SriPrarabdha commented on May 2, 2024

@Proteusiq that sounds great! How do you want to get started with this?

from open-assistant.

Proteusiq avatar Proteusiq commented on May 2, 2024

@Proteusiq that sounds great! How do you want to get started with this?

I have tomorrow. I could start with a prototype and add snippets here and we can see how to go about. What say you?

from open-assistant.

SriPrarabdha avatar SriPrarabdha commented on May 2, 2024

Yeah for sure👍

@Proteusiq that sounds great! How do you want to get started with this?

I have tomorrow. I could start with a prototype and add snippets here and we can see how to go about. What say you?

from open-assistant.

Proteusiq avatar Proteusiq commented on May 2, 2024

Path to getting data. I have tested with Postman: We can use requests or httpx Sessions

GET e.g.

https://api.pushshift.io/reddit/search/submission?subreddit=whatisthisthing&size=10

DATA can be gathered in time buckets with before and after params. I will upload a snippet code tomorrow

API params

from open-assistant.

yk avatar yk commented on May 2, 2024

can both of you DM me somehow? discord, twitter, all good :) makes coordination easier

from open-assistant.

SriPrarabdha avatar SriPrarabdha commented on May 2, 2024

can both of you DM me somehow? discord, twitter, all good :) makes coordination easier

Alrighty👍

from open-assistant.

Proteusiq avatar Proteusiq commented on May 2, 2024

@SriPrarabdha can you collect initial list of subreddits?

from open-assistant.

SriPrarabdha avatar SriPrarabdha commented on May 2, 2024

I've already shared some the subreddits that we can use and will update if I find some new ones

from open-assistant.

Proteusiq avatar Proteusiq commented on May 2, 2024

These ones:

r/NoStupidQuestions 
r/AskReddit
r/answers
r/ExplainLikeImFive
r/AskScience

?

from open-assistant.

SriPrarabdha avatar SriPrarabdha commented on May 2, 2024

Yeah these one

These ones:

r/NoStupidQuestions 
r/AskReddit
r/answers
r/ExplainLikeImFive
r/AskScience

?

from open-assistant.

SriPrarabdha avatar SriPrarabdha commented on May 2, 2024

I have collected initial data in JSON Form while preserving the graph structure of these comments. How should I share it with you guys to have a look!

from open-assistant.

yk avatar yk commented on May 2, 2024

I have collected initial data in JSON Form while preserving the graph structure of these comments. How should I share it with you guys to have a look!

upload here or discord.

do you have code for this somewhere in a fork?

from open-assistant.

SriPrarabdha avatar SriPrarabdha commented on May 2, 2024

I have put together the code and JSON file in this repo https://github.com/SriPrarabdha/Reddit-Scrapper
But the main problem is parsing one post on a subreddit with 15K comments took around 25 minutes. So even scrapping 1 subreddit completely will take a long time

from open-assistant.

Proteusiq avatar Proteusiq commented on May 2, 2024

@SriPrarabdha I think you are after something. We can always make the scrapper faster. Update on https://api.pushshift.io/reddit/comments/

import pandas as pd
from httpx import Client

HEADERS = {"User-Agent": "Prayson W. Daniel <[email protected]>"}
BASE_URI = "https://api.pushshift.io/reddit"


timeout = 60 # seconds
subreddit = "whatisthisthing"
size = 10
score = 20
num_comments = 10 # has no effect

with Client(base_url=BASE_URI, headers=HEADERS) as request:
    
    print("Fetching submission")
    s = request.get(url="/search/submission",
                    params=params,
                    timeout=timeout)
    
    print("Fetching comments")
    _ids = ",".join(item.get('id') for item in s.json().get("data"))
    params.update({"ids":_ids})
    c = request.get(url="/search/comment",
                    params=params,
                    timeout=timeout)
                    

# Return only needed columns with `fields`
# merge the submission to the comments

datac = pd.DataFrame(c.json().get('data'))
datas = pd.DataFrame(s.json().get('data'))

I will try downloading files instead from https://files.pushshift.io.

The are huge: RC 2022-10 => 23.8 GB and RS => 9.5.

from open-assistant.

Proteusiq avatar Proteusiq commented on May 2, 2024

@yk and @SriPrarabdha: Updates on files: It is possible to get data offline: I download RC and RS files for tests. This is where I am:

import json
from collections.abc import Generator
from io import TextIOWrapper
from pathlib import Path
import pandas as pd
from zstandard import ZstdDecompressor, open as zopen


def smart_open(file_path: Path) -> Generator[str]:
    """
    Use:
    ```python
    import json
    from pathlib import Path
    
    blobs = map(json.loads, smart_open(file_path=Path(example.zst)))
    needed = {blob.get("needed") for blob in blobs}
    ```
    """
    DCTX = ZstdDecompressor(max_window_size=2**31)
    with zopen(file_path, mode="rb", dctx=DCTX) as z, TextIOWrapper(z) as f:
        for blob in f:
            yield blob
            
DATA_DIR = Path("../data")
submission_objects = smart_open(DATA_DIR / "RS_2022-10.zst")
submission_blobs = map(json.loads, submission_objects)

subreddit = "whatisthisthing"
num_comments = 10 

# working on finding a faster or better way to do this
datas_gen  = (blob for blob in blobs 
         if (blob["subreddit"] == subreddit and 
             blob["num_comments"] >= num_comments)
)

data = pd.DataFrame(datas_gen)

The idea is to get ids and questions from the submission and their comments from comments. Merge and groupby id order by reply time on the comments.

from open-assistant.

yk avatar yk commented on May 2, 2024

looks pretty neat so far, nice work! is there a chance we could use something like typer or so, to make this into a script that takes flags to define things like data location etc?

from open-assistant.

doroshroman avatar doroshroman commented on May 2, 2024

Guys, do you need help to speed up parsing?
I can step in and try to help you.

from open-assistant.

doroshroman avatar doroshroman commented on May 2, 2024

Parsing is not needed as the data is in JSON (python dictionary) but accessing what we need is needed. Have you worked with hyperjson or orjson?

Actually, didn't have a chance to work with these libraries. But, It's not late to learn something new

from open-assistant.

doroshroman avatar doroshroman commented on May 2, 2024

Also, what kind of trees you want to build from the json representations?

from open-assistant.

Proteusiq avatar Proteusiq commented on May 2, 2024

Also, what kind of trees you want to build from the json representations?

Something like:
id "ABC", submission: "What happened to Batman?"
In comments, we fetch comments where id = "ABC"
sort the comments by time of reply

 id "ABC", submission: "What happened to Batman?"  Time 10:30
 id "ABC", comment: "Because Catwoman happened" Time 10:45
 id "ABC", comment: "No way" Time 10:46

So we have replay as they come in. The tree is from submission -> earliers_comments

Sometimes the comments can branch out to others own comments ...

Updates: Using generator allows me to keep calling and stoping using Jupyter: Getting submission is fast but matching them to comment takes forever

# instead of json
import orjson as json
...

break_point = 100
datas_list = [] 
for blob in blobs:
    if break_point < 0:
        break
    
    if (blob["subreddit"] == subreddit and 
        blob["num_comments"] >= num_comments):
        print(".", end="")
        break_point -= 1
        datas_list.append(blob)
 
 ids = set(b.get("id") for b in datas_list)
print(f"number of {ids=}")

com_objects = smart_open(DATA_DIR / "RC_2022-10.zst")
blobc = map(json.loads, com_objects)

## just to see how long it takes to get 10 match :(
break_point = 10
datac_list = [] 
for blob in blobc:
    if blob["subreddit"] != subreddit:
        continue
    
    if break_point < 0:
        break
    print(".", end="")
    if blob["id"] in ids:
        print("X", end="")
        break_point -= 1
        datac_list.append(blob)
...

could be I am matching on the wrong things. Maybe in the comments, I need parent_id. I will keep one searching

from open-assistant.

Proteusiq avatar Proteusiq commented on May 2, 2024

I can write the multiprocessing version of this, which can speed up matching, just attach full file with code

Super! I got it working now. In submission, I needed "name", and in comments "parent_id"

Notes: prints are just for debugging… needs to be removed

Full code

import orjson as json
from collections.abc import Generator
from io import TextIOWrapper
from pathlib import Path
import pandas as pd
from zstandard import ZstdDecompressor, open as zopen


def smart_open(file_path: Path) -> Generator[str]:
    """
    Use:
    ```python
    import orjson as json
    from pathlib import Path
    
    blobs = map(json.loads, smart_open(file_path=Path(example.zst)))
    needed = {blob.get("needed") for blob in blobs}
    ```
    """
    DCTX = ZstdDecompressor(max_window_size=2**31)
    with zopen(file_path, mode="rb", dctx=DCTX) as z, TextIOWrapper(z) as f:
        for blob in f:
            yield blob
            
DATA_DIR = Path("../data")
submission_objects = smart_open(DATA_DIR / "RS_2022-10.zst")
comment_objects = smart_open(DATA_DIR / "RC_2022-10.zst")

submission_blobs = map(json.loads, submission_objects)
comment_blobs = map(json.loads, comment_objects)

# params
subreddit = "whatisthisthing"
num_comments = 10 

# get 101 submissions with num_comments >= 10
break_point = 100
datas_list = [] 
for blob in submission_blobs:
    if break_point < 0:
        break
    
    if (blob["subreddit"] == subreddit and 
        blob["num_comments"] >= num_comments):
        print(".", end="")
        break_point -= 1
        datas_list.append(blob)

# get the ids
ids = set(b.get("name") for b in datas_list)
print(f"we have {len(ids)} unique ids"}

# this takes long just to get 10
break_point = 10
datac_list = [] 
for blob in comment_blobs:
    if blob["subreddit"] != subreddit:
        continue
    
    if break_point < 0:
        break
    if blob["parent_id"] in ids:
        print(".", end="")
        break_point -= 1
        datac_list.append(blob)

# merging of data ...

from open-assistant.

Proteusiq avatar Proteusiq commented on May 2, 2024

From a previous project of mine I have all the reddit comments and submissions on pushshift from 2005-12 to 2021-06 stored on a local server, as well as some code to scrape it. It may be easier for me to scrape the data locally and submit it as a json. The code I have is originally adapted from DialoGPT's reddit extractor, it may be helpful to give it a look. https://github.com/microsoft/DialoGPT

That would be perfect 😍: looks like we are reinventing the wheel https://github.com/microsoft/DialoGPT/blob/master/reddit_extractor/src/reddit.py

from open-assistant.

doroshroman avatar doroshroman commented on May 2, 2024

Made some refactoring and please update your DATA_DIR and smart_open pathes.
If it's still relevant

Also, I think it's better to make bigger chunk_len (about 50000)

from open-assistant.

emersonium avatar emersonium commented on May 2, 2024

Hi, I would like to help. I am following, this is great progress so far. Maybe go after some other sources of data while you are focused on Reddit. My question @yk , @Proteusiq is what is the format we wish to end up with , is it a JSON schema, have we determined that, on that or is that something we are working towards. I am familiar with web scarping etc. but not familiar with NLP and what an ideal format for the data is. I understand the MVP objective tho, so if we can have some clarity, I could go look for other potential sources that might work for the "question>answer-thread" conversational objective, and get them scraped and
formatted correctly. thanks

from open-assistant.

yk avatar yk commented on May 2, 2024

Hi, I would like to help. I am following, this is great progress so far. Maybe go after some other sources of data while you are focused on Reddit. My question @yk , @Proteusiq is what is the format we wish to end up with , is it a JSON schema, have we determined that, on that or is that something we are working towards. I am familiar with web scarping etc. but not familiar with NLP and what an ideal format for the data is. I understand the MVP objective tho, so if we can have some clarity, I could go look for other potential sources that might work for the "question>answer-thread" conversational objective, and get them scraped and
formatted correctly. thanks

yes I think a common json schema (or parquet, protobuf, or something) totally makes sense. @lewtun what do you think?

from open-assistant.

SriPrarabdha avatar SriPrarabdha commented on May 2, 2024

@yk and @Proteusiq I have made a simple typer CLI application and made it available on PyPI- https://pypi.org/project/reddit-comment-scrapper/
Any Suggestions on how to make it better?

looks pretty neat so far, nice work! is there a chance we could use something like typer or so, to make this into a script that takes flags to define things like data location etc?

from open-assistant.

yk avatar yk commented on May 2, 2024

From my end, I have end-to-end flow now but unlike DialoGPT, it does not have data preprocessing. So we are good to go if we can use DialoGPT Daniel's adoption. The only left task will be qualifying good questions and answers.

sweet, thank you very much! make sure to retain DialoGPT's MIT header :)
Once you're done, could you make a PR with the code? @lewtun any comments on how & where?

from open-assistant.

danielpwarren avatar danielpwarren commented on May 2, 2024

I've put my modified code and put it up on danielpwarren/reddit-extractor. It's not great and I don't have much time to work on it atm. I'll run it locally with the aforementioned subreddits and post here when it's done. The data currently is output in tsv format and there's an example in the repo.

from open-assistant.

yk avatar yk commented on May 2, 2024

In #282 @andrewm4894 suggests r/amitheasshole

Could be a way to convert this into more structured training data that actually might encode a lot of nuance.
There is lots of rules and hurustics to that subreddit such that would could extract or convert it into a sort of soft label type dataset that maybe could be useful.
Apologies if this is a dupe as am sure reddit data already on roadmap, more so that there could be a subset of subreddits that could be enriched or transformed in some way to make them even more useful.

from open-assistant.

andrewm4894 avatar andrewm4894 commented on May 2, 2024

For data sources like this - would/could/should we have some sort of example dummy data as a sort of target of what is needed in terms of format or structure before we do any work on it?

I can imagine there will be a lot of issues getting created with source suggestions and it could maybe be useful or help cut down on noise if there was some clear "target templates" or something that people could try stick to?

Still only getting up to speed so apologies if this is already done or perhaps might create too much friction right now - thoughts?

from open-assistant.

yk avatar yk commented on May 2, 2024

For data sources like this - would/could/should we have some sort of example dummy data as a sort of target of what is needed in terms of format or structure before we do any work on it?

probably @lewtun is the person to talk to for this

from open-assistant.

huu4ontocord avatar huu4ontocord commented on May 2, 2024

@SriPrarabdha or @Proteusiq - can we get a sample set of data (< 100) to see if we can convert into instructions?

from open-assistant.

huu4ontocord avatar huu4ontocord commented on May 2, 2024

@Proteusiq and
@SriPrarabdha
checking on status. thank you!

from open-assistant.

Proteusiq avatar Proteusiq commented on May 2, 2024

Hej @ontocord

I saw the issue closed, so I assumed that @danielpwarren way was the path forward. @danielpwarren do you have the samples? Otherwise, I could extract from my script Tomorrow.

from open-assistant.

huu4ontocord avatar huu4ontocord commented on May 2, 2024

@Proteusiq issue is still open.

from open-assistant.

Anan-Saadi avatar Anan-Saadi commented on May 2, 2024

@Proteusiq is this issue still active?
if so I'd like to contribute

from open-assistant.

Proteusiq avatar Proteusiq commented on May 2, 2024

Yes, it is. We are missing CLI part

from open-assistant.

Anan-Saadi avatar Anan-Saadi commented on May 2, 2024

@Proteusiq sorry for the late replay but what's exactly needed for us to produce a usable dataset? from what I can tell @danielpwarren has a very sophisticated workflow

from open-assistant.

Proteusiq avatar Proteusiq commented on May 2, 2024

@Proteusiq sorry for the late replay but what's exactly needed for us to produce a usable dataset? from what I can tell @danielpwarren has a very sophisticated workflow

Hi, @Anan-Saadi

We are missing two things:

  • data to correct format
  • typer cli which will accept a toml file of instructions

Data Format

  • JSON [{question:
    answer1:
    answer2:
    answer3:},
    {question:
    ....
    }]

CLI

  • accepts a toml file that contains a list of subredits and two downloaded files (submission and comment)

We have a started code already, my work just keeps me busy to complete...

from open-assistant.

Anan-Saadi avatar Anan-Saadi commented on May 2, 2024

@Proteusiq Ok I'll see what I can do in the coming few days

from open-assistant.

bitplane avatar bitplane commented on May 2, 2024

Lots of candidate subreddits here:

https://redditlist.com/search?adultfilter=0&searchterm=ask

from open-assistant.

jjmachan avatar jjmachan commented on May 2, 2024

Hey there @Proteusiq @SriPrarabdha I was wondering if you guys had any updates on this? I would love to help with this too since I need a similar setup to scale up #1967. I was currently using PRAW and api.pushshift but reached the limits of how much data we can get from it. Using the files from files.pushshift as you guys discussed seems like a very good idea and I was wondering if we can work on this together.

@Proteusiq do you have a PR or repo with the code?

from open-assistant.

Proteusiq avatar Proteusiq commented on May 2, 2024

No! The code is in this thread above. ☝️

from open-assistant.

MightEnlightenYou avatar MightEnlightenYou commented on May 2, 2024

So I'm not a dev but I've been able to scrape some other things in the last days with the help of Bing.

Why aren't you guys using PRAW, beautifulsoup or Scrapy instead?

And can someone summarize the following: what's the desired output format; what subreddits should be scraped (apart from the ones listed) and what criteria should the scraping have (e.g top 1000 posts from each subreddit or with specific tags or keywords)?

If someone can answer these thing I think that I should be able to do it if someone explains to me where I upload the data since I expect that there's going to be a lot.

from open-assistant.

P1ayer-1 avatar P1ayer-1 commented on May 2, 2024

I have a working version that bypasses the rate limit, provides access to all content (NSFW and non-NSFW) and can scrape whole subreddits. I used scrapy.

Right now the program scrapes a post and then the post's top 5 replies. I am currently working on integrating a database to avoid duplicates.

from open-assistant.

bitplane avatar bitplane commented on May 2, 2024

Why aren't you guys using PRAW, beautifulsoup or Scrapy instead?

Usually you'd want to use an API where possible, it's the more correct and respectful thing to do (saves rendering resources). Also API designers have probably thought of all sorts of edge cases that you'd figure out bit by bit while writing a scraper.

But that said, we've got monthly Reddit dumps on files.pushshift.io that look exhaustive, so it's best to just use them. Not sure if they contain quarantined subreddits, which would be useful for filtering things. But IMO it's best to just transform everything on Reddit into our format in one dataset, make it work in an incremental way as new months are added, and filter it for different uses as needed.

from open-assistant.

bitplane avatar bitplane commented on May 2, 2024

Right now the program scrapes a post and then the post's top 5 replies. I am currently working on integrating a database to avoid duplicates.

The API does this, and the files.pushshift.io data has dupes removed. I strongly recommend using that instead if possible

from open-assistant.

P1ayer-1 avatar P1ayer-1 commented on May 2, 2024

I realize now that I do not make the kaggle dataset public.. Here is the proper link https://www.kaggle.com/datasets/noahpersaud/reddit-confessions-oa
Also, here is aita https://www.kaggle.com/datasets/noahpersaud/reddit-amitheasshole-oa.

Here is the link to the scraper repo: https://github.com/P1ayer-1/Reddit-Convo-Tree-Builder
I didn't have time to do a proper readme so it isn't great and I might've forgotten stuff. So just let me know if anyone runs into any issues.

Here is the csv file I used for aita that can serve as a demo until full pushshift csvs are uploaded.
https://www.kaggle.com/datasets/noahpersaud/176-million-ramitheasshole-submissions

I intend to upload my raw parsed pushshift files on Kaggle soon.

from open-assistant.

namelessperson0 avatar namelessperson0 commented on May 2, 2024

https://commoncrawl.org can also be used to get reddit dump.

from open-assistant.

andreaskoepf avatar andreaskoepf commented on May 2, 2024

Closing old data issue that has not been completed by now.

from open-assistant.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.