Git Product home page Git Product logo

google / project-ocean Goto Github PK

View Code? Open in Web Editor NEW
48.0 13.0 19.0 310 KB

Project OCEAN is an open science collaboration focused on understanding the open source ecosystems creating datasets that enable research and forming a clear understanding of the state of open source communities.

Home Page: https://vermontcomplexsystems.org/partner/OCEAN/

License: Apache License 2.0

Python 61.50% Go 38.50%
angular golang go nodejs python opensource research ecosystems graphnetwork

project-ocean's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

project-ocean's Issues

Automate Pipermail and Mailman data ingestion

Setup scheduler or something similar to run pipermail and mailman to pull mailing list data monthly at end of month

Date range should be 1st day to last day of the previous month and run (triggered) on 1st day of the new month.

This will then connect to the work on Issue #26 to move into BQ

Archiving mailing list data pipelines

OCEAN currently doesn't have any active users of this research dataset, but this code may be still useful, but is not going to be actively maintained.

Additionally given ongoing dependency updates (eg. #96) and issues with some data sources over time (#94), these pipelines will be moved to an archive folder.

Corruption in Pipermail Files | Special Character Preservation

Expected Behavior

Preserve special characters.

Actual Behavior

Appears that the files didn't fully preserve special characters. Files may just be ascii, using literal '?'s for non-ascii chars. Need further investigation to determine how much this is an issue and how to resolve.

Example

One file I'm using as a check is pipermail-python-list-gzip/2011-February.txt.gz ...If you search for the name "Westley" in that file, you'll see lines like "Westley Mart?nez wrote:" , with (apparently) a literal "?". (the header lines in the file like: "From: anikom15 at gmail.com (Westley =?ISO-8859-1?Q?Mart=EDnez?=)" are treated differently -- those are decodable)

Investigation confirmed that these malformations are in the original zipped files that were on the site.

Steps to Reproduce the Problem

Specifications

  • Version:
  • Platform:

Data Analysis and Research

Expected Behavior

Data analysis using the existing datasets to show how the contributions to these ecosystems and their impact

Actual Behavior

Nothing built specifically at this time. Blank canvas that welcomes someone to contribute.

This is a very general issue and needs to be broken down into smaller chunks. Starting place is data scripts for analysis and reporting.

Security Policy violation Binary Artifacts

This issue was automatically created by Allstar.

Security Policy Violation
Project is out of compliance with Binary Artifacts policy: binaries present in source code

Rule Description
Binary Artifacts are an increased security risk in your repository. Binary artifacts cannot be reviewed, allowing the introduction of possibly obsolete or maliciously subverted executables. For more information see the Security Scorecards Documentation for Binary Artifacts.

Remediation Steps
To remediate, remove the generated executable artifacts from the repository.

Artifacts Found

  • 2-transform-data/cloud_func_bq_ingest/pycache/msgs_storage_bq.cpython-38.pyc

Additional Information
This policy is drawn from Security Scorecards, which is a tool that scores a project's adherence to security best practices. You may wish to run a Scorecards scan directly on this repository for more details.


Allstar has been installed on all Google managed GitHub orgs. Policies are gradually being rolled out and enforced by the GOSST and OSPO teams. Learn more at http://go/allstar

This issue will auto resolve when the policy is in compliance.

Issue created by Allstar. See https://github.com/ossf/allstar/ for more information. For questions specific to the repository, please contact the owner or maintainer.

Reload data into GCS and BQ

Clean up data load

  • Move all mailing list content into a specific GCS bucket
  • Fix golang files names to drop gg
  • Add groupname to filename for files in the mailing list folders
  • Load all mailing list content into BQ again because the structure has changed

Include Google Groups original message url

Expected Behavior

Pull out url where the data originated from for pipermail and mailman and put in BQ

Actual Behavior

Currently the urls used to get the data are not stored and this would be good for reference

Ideas for how to fix

Include the url in the filename (the syntax is an issue)
Open files and append url on the end

Add original content url to pipermail and mailman data

Expected Behavior

Pull out url where the data originated from for pipermail and mailman and put in BQ

Actual Behavior

Currently the urls used to get the data are not stored and this would be good for reference

Ideas for how to fix

  • Include the url in the filename (the syntax is an issue)
  • Open files and append url on the end

Google Groups formatting changed, unit test issues

TL;DR: No Google Groups ingestion currently because of changes to Google Groups, causing scraping code to fail.

Discovered while trying to update dependencies.

Zero topics

Monthly pipeline processing was showing 0 topics returned:

2022/11/01 08:01:32 GOOGLEGROUPS loading golang-checkins:
2022/11/01 08:01:32 All topics captured: total topics captured are 0.

Checking the go code for how topic counts are captured, the regex doesn't match current Google Groups UI (there may have been some MaterialUI changes since this code was written).

E.g. https://groups.google.com/g/golang-checkins shows 1โ€“30 of 81553 (specifically โ€“ is \u2013 EN DASH). The regex in getTotalTopics specifies - (\u002D HYPHEN-MINUS).

So because the topic counts are 0, it's effecting loops later on (in my estimation)

Nest unit tests

Additionally, trying to run unit tests, it appears running just mailinglists/ doesn't run the nested mailing lists, so the unit tests for googlegroups weren't being run (and are currently breaking)

Failing topic unit tests

Now running the unit tests:

=== RUN   TestTopicIDToRawMsgUrlMap/Pull_topic_ids_for_date
2022/11/15 22:40:43 No message ID found in topicId: 8sv65_WCOS4.
    googlegroups_data_test.go:300: Result response does not match.
         got: map[2018-09.txt:[]]
        want: map[2018-09.txt:[https://groups.google.com/forum/message/raw?msg=golang-checkins/8sv65_WCOS4/3Fc-diD_AwAJ]]

Infinite redirects

This URL is no longer a valid URL format, as trying to curl it gets stuck in an infinite 301 redirect loop:

$ curl https://groups.google.com/forum/message/raw\?msg\=golang-checkins/8sv65_WCOS4/3Fc-diD_AwAJ
<HTML>
<HEAD>
<TITLE>Moved Permanently</TITLE>
</HEAD>
<BODY BGCOLOR="#FFFFFF" TEXT="#000000">
<H1>Moved Permanently</H1>
The document has moved <A HREF="https://groups.google.com/forum/message/raw?msg=golang-checkins/8sv65_WCOS4/3Fc-diD_AwAJ">here</A>.
</BODY>
</HTML>

Summary

This is going to take some re-engineering to work out what's changed in the Google Groups format to bring this code back to working.

Improve To parsing in the mailing list data that is loaded to BQ

Expected Behavior

Mailing list To field should be populated by the target person that the email is responding to.

Actual Behavior

In python_mailinglist table there are some messages where To is showing up in the body but not populating the To field.

Body: "B Zy < zy at gmail.com> wrote:

Hello
Help my code."

The To field is capturing the mailing list name instead.

Steps to Reproduce the Problem

  1. Review python mailing list examples
  2. Improve parsing in the extract_msgs script, probably a regex for the body

List datasets in Repo

List datasets for the ecosystems that we have found and have interest in assessing.

Google Groups not loading all topics | Topic hidden because it was flagged

UPDATE: Found that the messages that were not loading were the ones reported for abuse and hidden. A topic id and message id are found for these messages but there wasn't a date to generate the filename. Create abuse.txt catchall filename and will add this to the table structure in BigQuery

Expected Behavior

All topics should be captured from Google Groups

Actual Behavior

It is coming up short by less than 100 for capture. Potentially an issue in the GoRoutine

Steps to Reproduce the Problem

  1. Run capture for angular or nodejs google groups and look at the total expected vs actual reported

This was an issue before and it had to do with the goroutine collapsing the content but its unclear where the miss is now.

Handle corrupted content in mailman mailing list data

Expected Behavior

Load all mailman mailing list text

Actual Behavior

Errored out on post on March 9th 2002 in python-dev list because it had diamond question marks in the content

Steps to Reproduce the Problem

  1. Add code to catch unknown content and parse around it to pull the uncorrupted text

Fix Python BQ script date timezone hanlding

Expected Behavior

parse_datestring function will handle all timezones despite format and parse them correctly

Actual Behavior

parse_datestring only handling timezones if offset number provided. It is missing information when the timezone letters or words are provided like:

Wed, 25 Oct 2006 19:21:24 GMT

Test output should be 2006-10-25 21:21:24

Add python tests to CI/CD

Expected Behavior

Python tests run in GitHub actions

Actual Behavior

Tests currently not automated to run

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.