TL;DR: No Google Groups ingestion currently because of changes to Google Groups, causing scraping code to fail.
Discovered while trying to update dependencies.
Zero topics
Monthly pipeline processing was showing 0 topics returned:
2022/11/01 08:01:32 GOOGLEGROUPS loading golang-checkins:
2022/11/01 08:01:32 All topics captured: total topics captured are 0.
Checking the go code for how topic counts are captured, the regex doesn't match current Google Groups UI (there may have been some MaterialUI changes since this code was written).
E.g. https://groups.google.com/g/golang-checkins shows 1โ30 of 81553 (specifically โ
is \u2013 EN DASH
). The regex in getTotalTopics
specifies -
(\u002D HYPHEN-MINUS
).
So because the topic counts are 0, it's effecting loops later on (in my estimation)
Nest unit tests
Additionally, trying to run unit tests, it appears running just mailinglists/
doesn't run the nested mailing lists, so the unit tests for googlegroups
weren't being run (and are currently breaking)
Failing topic unit tests
Now running the unit tests:
=== RUN TestTopicIDToRawMsgUrlMap/Pull_topic_ids_for_date
2022/11/15 22:40:43 No message ID found in topicId: 8sv65_WCOS4.
googlegroups_data_test.go:300: Result response does not match.
got: map[2018-09.txt:[]]
want: map[2018-09.txt:[https://groups.google.com/forum/message/raw?msg=golang-checkins/8sv65_WCOS4/3Fc-diD_AwAJ]]
Infinite redirects
This URL is no longer a valid URL format, as trying to curl it gets stuck in an infinite 301 redirect loop:
$ curl https://groups.google.com/forum/message/raw\?msg\=golang-checkins/8sv65_WCOS4/3Fc-diD_AwAJ
<HTML>
<HEAD>
<TITLE>Moved Permanently</TITLE>
</HEAD>
<BODY BGCOLOR="#FFFFFF" TEXT="#000000">
<H1>Moved Permanently</H1>
The document has moved <A HREF="https://groups.google.com/forum/message/raw?msg=golang-checkins/8sv65_WCOS4/3Fc-diD_AwAJ">here</A>.
</BODY>
</HTML>
Summary
This is going to take some re-engineering to work out what's changed in the Google Groups format to bring this code back to working.