Git Product home page Git Product logo

openrefine.org's Introduction

OpenRefine

DOI Join the chat at https://gitter.im/OpenRefine/OpenRefine Snapshot release Coverage Status Translation progress

OpenRefine is a Java-based power tool that allows you to load data, understand it, clean it up, reconcile it, and augment it with data coming from the web. All from a web browser and the comfort and privacy of your own computer.

Official website: https://openrefine.org

Community forum: https://forum.openrefine.org

Download

Snapshot releases

You can download snapshots of the development version of OpenRefine. To do so, you need to be logged in to GitHub. Then, click on the first item with a green tick / check mark on this page and scroll down to the Artifacts section to find the version that matches your operating system.

Run from source

If you have cloned this repository to your computer, you can run OpenRefine with:

  • ./refine on Mac OS and Linux
  • refine.bat on Windows

This requires JDK 11 or newer, Apache Maven and NPM 16 or newer.

Documentation

Contributing to the project

Contact us

Licensing and legal issues

OpenRefine is open source software and is licensed under the BSD license located in the LICENSE.txt. See the folder licenses for information on open source libraries that OpenRefine depends on.

Credits

This software was created by Metaweb Technologies, Inc. and originally written and conceived by David Huynh. Metaweb Technologies, Inc. was acquired by Google, Inc. in July 2010 and the product was renamed Google Refine. In October 2012, it was renamed OpenRefine as it transitioned to a community-driven project.

Since 2020, OpenRefine is fiscally sponsored by Code for Science and Society (CS&S).

See CONTRIBUTING.md for instructions on how to contribute yourself.

openrefine.org's People

Contributors

abbe98 avatar allanaaa avatar amyra98 avatar antoine2711 avatar atescomp avatar cooperzoe avatar dependabot[bot] avatar elebitzero avatar felixlohmeier avatar hdevine825 avatar kushthedude avatar labiang avatar lydiaofficial avatar magdmartin avatar mareksuchanek avatar mkcor avatar mpparsley avatar nilswindisch avatar ostephens avatar prayagverma avatar rahasoleymanzadeh avatar robertgarrigos avatar rubenverborgh avatar tfmorris avatar thadguidry avatar trnstlntk avatar vdk avatar vitaly-zdanevich avatar vladimiralexiev avatar wetneb avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

openrefine.org's Issues

Create an openrefine-announce mailing list

The current OpenRefine blog is basically a giant black hole because there's no way to subscribe to it so posts which are important to the community, like the recent appointment of a new steering committee member, end up never getting seen. It would be worth considering establishing a low volume, post only, opt-in announcement mailing list that people can subscribe to.

Allow regular Wikimedians to edit Wikidata or Wikimedia accounts with their regular accounts, without requiring bot passwords

Proposal

@wetneb stated: IMHO doing away with the pop-up entirely, perhaps replacing it with some incentive to use bot passwords in the login form, would be a great move already.

@trnstlntk stated: I would suggest a new page on docs.openrefine.org dedicated to signing into Wikibase(merging the existing docs(1/2)) and replace the two existing paragraphs something pointing to the new page.
"Do you have trouble logging in?" "Safer login metods" or something combining those messages.

Original Comment

I notice during tests and demos that 'regular' Wikimedians (the kind of people who, like me, usually only do small batches and who don't write/operate bots) get confused when OpenRefine asks them to provide / get bot passwords or log in with an owner-only consumer
image
Can we simplify this process, or make this even unnecessary somehow?

Additional context

I'm creating this issue because at some point I got the following feedback in my log:

19:20:19.823 [..kibaseapi.ApiConnection] API warning [main]: Subscribe to the mediawiki-api-announce mailing list at <https://lists.wikimedia.org/postorius/lists/mediawiki-api-announce.lists.wikimedia.org/> for notice of API deprecations and breaking changes. Use [[Special:ApiFeatureUsage]] to see usage of deprecated features by your application. (2283ms)
19:20:19.824 [..kibaseapi.ApiConnection] API warning [login]: Main-account login via "action=login" is deprecated and may stop working without warning. To continue login with "action=login", see [[Special:BotPasswords]]. To safely continue using main-account login, see "action=clientlogin". (1ms)

and while I'm not well-versed in this stuff, it looks as if "action=clientlogin" is something worth looking into?

Linux Install Instructions Enhancement

There is a lack of clear instructions on installing and running OpenRefine on Linux. The current documentation for Linux installation only state that the user should download the tar.gz file containing the file, tar xzf openrefine-linux-3.4.tar.gz

Proposed solution

At the very least, the current documentation for Linux installation should be amended, as the command tar xzf openrefine-linux-3.4.tar.gz is deprecated. Note that the naming format is different from the current convention: openrefine-linux3.4.tar.gz instead of openrefine-3.6.2-linux.tar.gz. A solution would be replacing that command suggestion with tar xzf openrefine-3.6.2-linux.tar.gz.

Additionally, I believe that OpenRefine's users would benefit from instructions on how to run the program from the command line. This could take the form of the following:

Use the cd command to navigate to the directory housing OpenRefine

brendan@librarium:~$ cd Documents/openrefine-3.6.2/
brendan@librarium:~/Documents/openrefine-3.6.2$ 

Run OpenRefine by entering ./refine

brendan@librarium:~/Documents/openrefine-3.6.2$ ./refine

OpenRefine will start.

Document which operations need a lot of RAM in the new architecture

In the new architecture (4.0 branch), some operations or importers have been optimized so that they do not require a lot of RAM even if the dataset is large. However as a user it is not necessarily clear which operations can safely be used in pipelines meant to run on large datasets. In some cases it might be possible for the user to find another way to carry out a transformation (or even do it externally). For this to be doable, they must be able to understand which operations should be avoided.

This applies to operations, but also importers and exporters.
A given operation / importer / exporter can be efficient in some settings, an inefficient in others. For instance the CSV importer is efficient if the multiLine option is set to true but inefficient if multiLine is set to false and escaping is enabled.

Proposed solution

  • Document scalability of each of these components in the official manual
  • Consider adding some warnings to the UI before triggering an operation which might require a lot of RAM (perhaps not so easy?)

Alternatives considered

Not sure?

Use standard logo as header

The version of the OpenRefine logo that is used on this website (wireframe version of the diamond) is not used anywhere else. I think OpenRefine's official website should use OpenRefine's official logo! If it does not fit with the rest of the design, then the rest of the design is wrong.

Migrate content from MediaWiki bot page on wiki to user manual

New user manual has details about creating a bot for Wikidata https://docs.openrefine.org/manual/wikidata#manage-wikidata-account

There is a page on the wiki that gives more general information about creating a bot account for any Wikibase instance https://github.com/OpenRefine/OpenRefine/wiki/MediaWiki-Bot-Passwords

The more general information should be migrated across to the new user manual and then the MediaWiki-Bot-Passwords page in the wiki can be deprecated (replace the content with a link to the relevant page in the new user manual)

Documentation edit process with versioned docs

We have a problem with the way we edit our documentation.

Docusaurus includes links from the documentation to GitHub so that a given page can be edited directly. That is very convenient, however it does not tie in very well with versioning. By default, the docs that are visible at docs.openrefine.org are the docs for the latest stable version, and so the edit link points to the corresponding versioned page:

https://docs.openrefine.org/manual/starting links to https://github.com/OpenRefine/OpenRefine/edit/master/docs/versioned_docs/version-3.5/manual/starting.md

However, it is important that the "current" docs are updated too, since they will become the stable docs at the next release.
Not doing so means that the changes will disappear from docs.openrefine.org at the next release.

This is affecting the following PRs:

I will check if other PRs have been affected and backport the changes to the 3.5 and current docs.

This makes me wonder:

  • The technical reference probably does not need this versioning mechanism. Is there a way we could exclude it from the versioning mechanism?
  • Maybe the versioning we are doing now is a bit too agressive: there are not so many user-visible changes between 3.4 and 3.5 (the main part is the Wikibase integration), perhaps we should rather just have separate docs for 3.x and 4.x?
  • Could we have some CI mechanism to flag changes which only edit versioned docs and not the current docs?

Remove Disqus comments

  • They are not actually used much
  • They use a third-party service (with all privacy implications this entails)
  • Disqus use it to embed unrelated content
  • They do not look professional in my opinion

migrate from Maruku to Kramdown

When updating the website I received the following email warning:

Your site is using Maruku, the default Markdown interpreter. Maruku is now obsolete and may cause builds to fail for sites with invalid Markdown or HTML. See https://help.github.com/articles/migrating-your-pages-site-from-maruku for more information on upgrading to a newer Markdown interpreter.

We should do the migration on a separate fork and test it before merging it back here.

Misalignment in book topics on website

I visited the OpenRefine's website, and saw this:

image

which probably only requires a simple fix. However, I will be going to bed soon so I thought I'd better post this here first before I forget. Feel free to fix this in the meantime.

Update Support/Sponsors and make it a list

We should make a separate div list element with logos and dates next to each for our sponsors.
Or just have a Sponsors page separately...dunno...

Something like this for now...

Sponsors:

  • 2020 Chan Zuckerberg Initiative
  • 2018 Google News Initiative

Refresh website

We will appreciate PR to refresh the current website with new CSS and layout.
We like to stick with GitHub pages and Jekyll . Content remains the same (ie same pages, same text), we want to focus only on the style.

Quote from @thadguidry:

This would be an awesome portfolio opportunity for some teenager somewhere in the world :)

Unicode support on Regexp

Hello,

As a user, I want to deal with regex facets including accent selection into words. It is a solution to apply the regex modifier 'u' that would simplify my life. Strangely, it is possible to bypass this limitation using jython into columns creation, but it remains difficult because you have to code each transformation rather using a simple regex modifier.

As a solution, I wonder if the default modifier u could be used on project based on UTF files... For the moment, very annoying...

Best regards,

Blog post asking feedback on moving OpenRefine's mailing lists and Gitter to Discourse

We consider moving OpenRefine's mailing lists and Gitter to a web-based Discourse forum, and invite community feedback. A message will be sent to OpenRefine's mailing lists on September 20, 2022. But I also want to publish a blog post (with the same message) on that same day, so that we can tweet about it and generally invite feedback from those community members who are not on our mailing list.

Documentation: version only by major version

We currently offer separate documentation for each minor release of the tool (3.4, 3.5, …).
This has a few downsides:

  • when making improvements to the docs, we often need to edit multiple copies of the docs, which is cumbersome
  • not much is changing between each minor version anyway

At the same time, we do not have a good mechanism to publish the documentation of the 4.0 branch.
If we create a 4.0 version in Docusaurus, it will be the one displayed by default to the users (because it is the latest one), but since we have not published a stable release in that branch, this is undesirable (people should see 3.x first).

Proposed solution

Switch to only one documentation per major version instead. This means that at the moment, we would have version for OpenRefine 3, and the development docs would be for OpenRefine 4.

Alternatives considered

I am not sure how else we could publish docs for 4.0 with the current system.

Additional context

Also, that solution would mean that we would need to edit the docs for OpenRefine 4 on the master branch, which might be a bit confusing. If we also want to migrate our openrefine.org website to Docusaurus, perhaps this is a sign we should be using a different repository after all?

Document versions of OpenRefine we support

We will need to support both 4.x and 3.x versions of OpenRefine in the future and be able to handle Documentation, Security Patches, and perhaps using GitHub's Security Policy doc as I mentioned in PR OpenRefine/OpenRefine#2048 (now closed) with granular semantic versions (major.minor.patch)

@wetneb had suggested previously in that PR that we should probably do the following:

Given our current workflows we are basically not able to release fixes for any previous version. We would need to use branches for the major versions we want to support on the long term. That's definitely something we can consider. Each security patch would be merged in all supported major release branches to create the corresponding versions. That would probably mean using more granular release numbers (such as 3.2.0).

  • Verify that our chosen documentation system has support for Semantic Versioning OpenRefine/OpenRefine#2273
  • Decide on supported versions and branching
  • Verify that GitHub's Security Policy doc works on those supported branches.

Document the workaround steps for running OpenRefine on OS's that block running unsigned apps

Until we solve OpenRefine/OpenRefine#3003 and OpenRefine/OpenRefine#4568 we should document workaround steps.

Some OS's such as MacOS and Windows have additional steps needed for users to acknowledge installing applications downloaded from untrusted sources, or unsigned applications.

Proposed solution

Until OpenRefine has the capability to be considered an approved signed application for MacOS and Windows, we should:

  • provide workaround steps in our docs for running OpenRefine.
  • As a user, test on MacOS and Windows 10/11 that the workaround steps allow OpenRefine to be run.

Alternatives considered

Additional context

For example, here is a screenshot that could be added to our docs for Windows where running OpenRefine 3.5.2 openrefine.exe for the first time on Windows 11 where the user needs to click on More info and then click the button that appears Run anyway

image

Docs: Deduplicate browser lists in manual/{installing,running}.md

Both files contain the same section about compatible browsers, see https://github.com/OpenRefine/OpenRefine/pull/5231/files for example.

Proposed solution

… have a page on browser compatibility which is linked from both locations?

Alternatives considered

Define either one to be the single source of truth, and replace the 2nd list with a link to the 1st.

Additional context

This is a follow-up to an open discussion in the above-mentioned PR.

Fix 2 deficiencies in Documentation

There are two features which as a new user I didn't really understand because they weren't on the "Documentation For Users" page.

  1. Row mode vs Record mode. I finally figured it out thanks to this article
  2. The difference between null and the empty string ("") -- I still don't understand this. After reading as much as I could stomach of OpenRefine/OpenRefine#820, OpenRefine/OpenRefine#1544, OpenRefine/OpenRefine#1571, I still really don't understand the fundamental difference in how they are represented and how they change the functionality of OpenRefine (e.g. during export). I have also found out that it seems easy enough to convert from null --> "" with the coalesce() function, but I don't know if it's possible to go from "" --> null or if I would ever even want to do so.

Thanks so much for an awesome tool! I have loved most of it so far and am super excited for its future!

I hope that someone can add a simple explanation of these two details to the Documentation For Users page, possibly in their own sections.

Enhance Download page experience (hard to find, unreadable, downplay non-embedded versions)

The download page should be much more prominently displayed in the menu.

On the download page itself, we want to show Windows, Mac and Linux download buttons much more prominently, for the latest stable release. Other releases can be listed elsewhere.

Downloading OpenRefine should be a complete no-brainer.

For instance:

Some people even tweet at us to find the download link! https://twitter.com/rebelsouly/status/1207044835452039168

FAQ: How much data can OpenRefine deal with?

I'm always frustrated when OpenRefine can not deal with my data.

Proposed solution

Based on thread called "Measuring scale/limits of OpenRefine" with 1853 views in https://groups.google.com/g/openrefine/c/-loChQe4CNg/m/eroRAq9_BwAJ?pli=1 add a FAQ item that answers where OpenRefine limits come from, what sacrifices are required to lift them, what efforts are being made in this direction, and how to track, measure and join these efforts.

If full answer may appear too big, maybe it is possible to create a blog post that sums up how the CZI grant was spent in regard to scaling objective, so that other funds can decide if they want to join to support the initiative. Then how is https://github.com/OpenRefine/OpenRefine/projects?type=classic related to the goal (how much improvement in measurements are expected) and how 4.0 changes the situation (which issues are most critical).

It is interesting as well to read about technical details - how OpenRefine loses memory, that available memory is not only OpenRefine, but also eaten by the browser, if OpenRefine is able to track when memory goes into swap and productivity drops? Extremely important to see the list of features that need to be ported to support new Dataflow model, which features would need to be sacrificed, which will gain speed, which will lift memory limits, which will suffer?

Alternatives considered

Would be glad to know any.

Additional context

We were discussing the absence convenient interfaces for data wrangling и data preparation tools for big data, and if it is easier to rewrite something from scratch than try to enhance OpenRefine.

Document Wikibase schema format

OpenRefine lets users import and export the Wikibase schemas that they use in the tool, in the JSON format we use internally.
It could be useful to document this format which could be adopted by other tabular data integration tools.

Proposed solution

  • Document the overall structure of the schema in the developer docs
  • Offer a JSON schema

Alternatives considered

Perhaps there are reasons why we should instead want to keep this private? What sort of stability / versioning policy do we want to have here?

Additional context

Brought up in a discussion with @addshore

Expand Data Privacy page to include web site

The Data Privacy seems to focus entirely on the OpenRefine tool, but typically these also include things like web server log retention info, etc. It should probably be expanded to include at least cursory info or pointers to data privacy statements for:

  • openrefine.org web server
  • mailing lists
  • Discourse (if we end up going that direction)
  • Gitter

Even just a list of these things and statement that they have their own privacy statements would be a starting point.

Typos

From @mroswell reported at OpenRefine/OpenRefine#682

There are several typos on the OpenRefine.org home page.

activly -> actively
lastest -> latest

(Is this the proper place to report these? If not, please give guidance.)

Also:
http://openrefine.org/OpenRefine/community
writting tutorial -> writing a tutorial

"The event page also list previous presentations as catch up and source of inspiration for your presentation!"
could use a rewrite

And maybe replace
http://openrefine.org/OpenRefine/blog
with a twitter feed, since there's no blog content

New hideout theme - improve display for larger screen

We should also setup a layout for larger screens (mine is 1920x1080) so that the .container is allowed to flex out wider or if I shrink my browser window then it goes back to some minimum. Looks like the current width for the .container Box itself is too small by default on larger screens. Look at some of the examples and everything on this page to use Flexbox more effectively https://css-tricks.com/snippets/css/a-guide-to-flexbox/#flexbox-examples

39079858-6ceb4a6e-44e8-11e8-8370-a8589c4aceaa

initially reported here

Localization for openrefine.org

openrefine.org should be translated in various languages, just like the software is.

This might require moving out of GitHub pages if we want to auto-detect the language from browser settings.

Google Auth wiki page needs to improved

As done by @afkbrb, now we can configure the G-Data credentials from the refine.ini which is not updated in the wiki. Hence the document is still ambiguous for users as to how to finally put the G-Data credentials in the OpenRefine.
Therefore the Oauth Wiki Page should be updated which may ease the Oauth process for the users

-i 0.0.0.0 not working since 3.5?

Using 3.4.1 (docker run -p 3333:3333 felixlohmeier/openrefine:3.4.1), curl localhost:333 works.

Using 3.5.0, curl localhost:333 hangs. Is it because of -i 0.0.0.0 not being properly handled in 3.5.1??

Revamp / cleanup of OpenRefine's Wikibase documentation

OpenRefine started off with Wikidata support, and is now also increasingly supporting arbitrary Wikibases and now also Wikimedia Commons (yay!)

This is great, but our Wikibase documentation is still structured as a remnant of the old Wikidata-centric situation. Wikibase documentation has been added over time in various places, and in OpenRefine/OpenRefine#5023 I shoehorned in some documentation focused on Wikimedia Commons. But as a whole, IMO the documentation has become a bit inconsistent and potentially confusing for end users. I think it would be great to dedicate some time to a future-proof re-edit of that entire section.

Proposed solution

Fresh reorganization of the Wikibase support section. Some of my thoughts here:

  • I think it's good to often explicitly mention both Wikibase, Wikidata, and Wikimedia Commons. Wikidatans and Wikimedia Commons users may not be familiar with the term 'Wikibase' at all.
  • I have a hunch that we are mostly serving three distinct groups of users with this documentation: Wikibase users, Wikidatans, Wikimedia Commons editors. This may need a bit more research and testing, but I think it does make sense to provide various entry points if these are indeed large distinct groups (even if the back-end features, and the Wikibase extension as a whole, are considered as a holistic architecture from the technical side).

Additional context

Maybe we can fit this into the upcoming Wikibase-focused work (roughly July-December 2022) which will be funded by NFDI.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.