Git Product home page Git Product logo

Comments (4)

peterjc avatar peterjc commented on July 18, 2024

One approach is to give up on removing spam edits, and take the history as is - warts and all. This does at least credit the cleanup work fairly.

Another idea is to do a cleanup after parsing the XML into our intermediate SQLite3 database, where we can add a "hide" boolean. The idea here would be when hitting an unwanted revision, consider the next few revisions of that page, and look for a revert back to the last clean version. If found, then hide all those revisions to the page.

This would handle:

  • Spam from A
  • More spam from A or A2
  • Revert all from B

And also:

  • Spam from A
  • More spam from A or A2
  • Revert some by B
  • Revert rest by C

The problematic cases would would be left in the history and require manual intervention to remove.

from mediawiki_to_git_md.

peterjc avatar peterjc commented on July 18, 2024

Perhaps the best plan is leave the unwanted commits in the git history, although giving them a clear comment (e.g. "UNWANTED COMMIT") and perhaps also indicate this in the author field.

They can then be removed fairly easily with git rebase -i ... (thanks to the clear comment) possibly with --empty=drop to automatically drop the simple reverts.

from mediawiki_to_git_md.

peterjc avatar peterjc commented on July 18, 2024

The interactive rebase is working pretty well to remove spam, coupled with this to find a good preceding commit to start from:

$ git log --oneline | grep -C 1 UNWANTED

This assumes the spam accounts were all flagged before starting the XML dump to git conversion.

from mediawiki_to_git_md.

peterjc avatar peterjc commented on July 18, 2024

The default comment from a MediaWiki reversion via the interface seems to start "Reverted edits by " so this is also worth reviewing:

$ git log --oneline --grep revert -i

I am finding the rebases quite slow, but it may be partly down to the filesystem (sadly I can't run this locally on a case in-sensitive Mac).

from mediawiki_to_git_md.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.