Git Product home page Git Product logo

Comments (4)

s2t2 avatar s2t2 commented on August 22, 2024
-- TODO create table users_and_timeline_texts_sample (something like this)
SELECT 
   user_id
   ,count(distinct status_id) as tweet_count
   ,count(distinct case when retweeted_status_id is not null then status_id end) as rt_count
   -- also add count of retweets (will be less than the tweet count)
   ,min(date(created_at)) as first_tweet_on
   ,max(date(created_at)) as latest_tweet_on

   -- here we are grabbing at max X of the user's tweets at random:
  --,string_agg(t.status_text, '{TWEET_DELIMETER}' ORDER BY rand() LIMIT {int(TWEET_MAX)}) as tweet_texts

  ,string_agg(t.status_text, ' || ' ORDER BY rand() LIMIT 50) as tweet_texts

FROM `tweet-research-shared.election_2020_transition_2021_combined.tweets_v2_slim` t
GROUP BY 1
ORDER BY 2 DESC
LIMIT 10

from openai-embeddings-2023.

s2t2 avatar s2t2 commented on August 22, 2024

We did a one time bucket transfer from the upstream data collection environment, into the shared project, so now the models are accessible to researchers with access to the shared project.

GOOGLE_PROJECT_NAME="tweet-research-shared"
BUCKET_NAME="openai-embeddings-2023-shared"

from openai-embeddings-2023.

JiazanShi avatar JiazanShi commented on August 22, 2024

For the step 1:
Before we sample user data from 'election_2020_transition_2021_combined', we checked if there are duplicates in the dataset. For this dataset, there is no duplicate records but we find some users retweet with the same text. Since we want to use unique text data for training and texting our models, we checked the distribution of number of tweets and unique tweets per user.

-- check duplicated text for the same user
SELECT user_id, status_text,COUNT(DISTINCT status_id) AS text_dup_cnts
	FROM `tweet-research-shared.election_2020_transition_2021_combined.tweets_v2_slim` 
	GROUP BY 1,2
	ORDER BY 3 DESC
	LIMIT 10

--the number of tweets and unique tweets per user
SELECT user_id,
    COUNT(status_text) AS text_cnts,
    COUNT(DISTINCT status_text) AS dedup_text_cnts
    FROM `tweet-research-shared.election_2020_transition_2021_combined.tweets_v2_slim`
    GROUP BY 1

--also check the users in training dataset 
SELECT user_id,
    COUNT(status_text) AS text_cnts,
    COUNT(DISTINCT status_text) AS unique_text_cnts
    FROM `tweet-research-shared.impeachment_2020.tweets_v2`
    WHERE user_id IN ({str(user_list).strip('[]')}) --the user id list in training dataset
    GROUP BY 1

While there is no huge difference between the tweets and unique text, that means repeated text is not a huge proportion in our dataset.
newplot (15)

So for the following steps, we will not de duplicates for sampling user dataset.

from openai-embeddings-2023.

s2t2 avatar s2t2 commented on August 22, 2024

Closed by #28

from openai-embeddings-2023.

Related Issues (8)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.