openrefine / openrefine.org Goto Github PK

Source website for openrefine.org

License: Other

JavaScript 42.10% TypeScript 46.94% CSS 10.96%

openrefine.org's Introduction

OpenRefine

OpenRefine is a Java-based power tool that allows you to load data, understand it, clean it up, reconcile it, and augment it with data coming from the web. All from a web browser and the comfort and privacy of your own computer.

Official website: https://openrefine.org

Community forum: https://forum.openrefine.org

Download

OpenRefine Releases

Snapshot releases

You can download snapshots of the development version of OpenRefine. To do so, you need to be logged in to GitHub. Then, click on the first item with a green tick / check mark on this page and scroll down to the Artifacts section to find the version that matches your operating system.

Run from source

If you have cloned this repository to your computer, you can run OpenRefine with:

./refine on Mac OS and Linux
refine.bat on Windows

This requires JDK 11 or newer, Apache Maven and NPM 16 or newer.

Documentation

Contributing to the project

Contact us

Licensing and legal issues

OpenRefine is open source software and is licensed under the BSD license located in the LICENSE.txt. See the folder licenses for information on open source libraries that OpenRefine depends on.

Credits

This software was created by Metaweb Technologies, Inc. and originally written and conceived by David Huynh. Metaweb Technologies, Inc. was acquired by Google, Inc. in July 2010 and the product was renamed Google Refine. In October 2012, it was renamed OpenRefine as it transitioned to a community-driven project.

Since 2020, OpenRefine is fiscally sponsored by Code for Science and Society (CS&S).

See CONTRIBUTING.md for instructions on how to contribute yourself.

openrefine.org's People

Contributors

Stargazers

Watchers

Forkers

ak2consulting jqnatividad bmarshall-zenoss srab2001 abahgat nemanjabranisavljevic arowla edhuaman danyvillatoro kleopatra999 ssmaroju haisongzhang heidsoft-paas markmclane ostephens itahmid damekus semmelknoedl martinec bmcguirk joessp mkcor prayagverma joesdenatris ebraheemf rlugojr forme2022 parthasarathy93 jezcope joanneong rahasoleymanzadeh rahacanaweb thadguidry quelic randyamiel alba2010 wangxuan1203 dalavancloud misstracel mrkem598 egonza4 viglino dansefko dbswlals95 alicewn arch1273 gaybro8777 masterscott biancabook geraldoneto123 chepakrul vladimiralexiev jianghongping data-visualization-lectures kushthedude caoer-liu narcisse007 webteg osalimas gredoy gonepaul12 coopr lisa761 aghasaad04 pyrog shivaligakhar-123 jgiaccai cosmo65 nporia allanaaa aanandgupta akshadk7 gintian kratos974 neck192 militham adelrosarioh vdk labiang martinrefinepro crowleyb mpparsley annajiat padditr123 tfmorris elebitzero robertgarrigos wetneb bhaswatiroy huishin-pie sanga-pal 123rabida123 timalsinab amyra98 parmishh 5tigerjelly ytfghj hdevine825 seanpm2001 kalpeshpatil02

openrefine.org's Issues

Create an openrefine-announce mailing list

The current OpenRefine blog is basically a giant black hole because there's no way to subscribe to it so posts which are important to the community, like the recent appointment of a new steering committee member, end up never getting seen. It would be worth considering establishing a low volume, post only, opt-in announcement mailing list that people can subscribe to.

Allow regular Wikimedians to edit Wikidata or Wikimedia accounts with their regular accounts, without requiring bot passwords

Proposal

@wetneb stated: IMHO doing away with the pop-up entirely, perhaps replacing it with some incentive to use bot passwords in the login form, would be a great move already.

@trnstlntk stated: I would suggest a new page on docs.openrefine.org dedicated to signing into Wikibase(merging the existing docs(1/2)) and replace the two existing paragraphs something pointing to the new page.
"Do you have trouble logging in?" "Safer login metods" or something combining those messages.

Original Comment

I notice during tests and demos that 'regular' Wikimedians (the kind of people who, like me, usually only do small batches and who don't write/operate bots) get confused when OpenRefine asks them to provide / get bot passwords or log in with an owner-only consumer

Can we simplify this process, or make this even unnecessary somehow?

Additional context

I'm creating this issue because at some point I got the following feedback in my log:

19:20:19.823 [..kibaseapi.ApiConnection] API warning [main]: Subscribe to the mediawiki-api-announce mailing list at <https://lists.wikimedia.org/postorius/lists/mediawiki-api-announce.lists.wikimedia.org/> for notice of API deprecations and breaking changes. Use [[Special:ApiFeatureUsage]] to see usage of deprecated features by your application. (2283ms)
19:20:19.824 [..kibaseapi.ApiConnection] API warning [login]: Main-account login via "action=login" is deprecated and may stop working without warning. To continue login with "action=login", see [[Special:BotPasswords]]. To safely continue using main-account login, see "action=clientlogin". (1ms)

and while I'm not well-versed in this stuff, it looks as if "action=clientlogin" is something worth looking into?

Linux Install Instructions Enhancement

There is a lack of clear instructions on installing and running OpenRefine on Linux. The current documentation for Linux installation only state that the user should download the tar.gz file containing the file, tar xzf openrefine-linux-3.4.tar.gz

Proposed solution

At the very least, the current documentation for Linux installation should be amended, as the command tar xzf openrefine-linux-3.4.tar.gz is deprecated. Note that the naming format is different from the current convention: openrefine-linux3.4.tar.gz instead of openrefine-3.6.2-linux.tar.gz. A solution would be replacing that command suggestion with tar xzf openrefine-3.6.2-linux.tar.gz.

Additionally, I believe that OpenRefine's users would benefit from instructions on how to run the program from the command line. This could take the form of the following:

Use the cd command to navigate to the directory housing OpenRefine

brendan@librarium:~$ cd Documents/openrefine-3.6.2/
brendan@librarium:~/Documents/openrefine-3.6.2$

Run OpenRefine by entering ./refine

brendan@librarium:~/Documents/openrefine-3.6.2$ ./refine

OpenRefine will start.

Blog post with a summary of OpenRefine's 2022 user survey

We have just concluded our two-yearly user survey, and a blog post with a summary of the results is due.

Algolia search results are returned twice

Now that OpenRefine/OpenRefine#5127 is fixed, I can search eg for "jython".
But every hit is returned twice:

and the second match goes to an invalid (incomplete) URL: https://technical-reference/writing-extensions#server-side-scripting-languages

Document which operations need a lot of RAM in the new architecture

In the new architecture (4.0 branch), some operations or importers have been optimized so that they do not require a lot of RAM even if the dataset is large. However as a user it is not necessarily clear which operations can safely be used in pipelines meant to run on large datasets. In some cases it might be possible for the user to find another way to carry out a transformation (or even do it externally). For this to be doable, they must be able to understand which operations should be avoided.

This applies to operations, but also importers and exporters.
A given operation / importer / exporter can be efficient in some settings, an inefficient in others. For instance the CSV importer is efficient if the multiLine option is set to true but inefficient if multiLine is set to false and escaping is enabled.

Proposed solution

Document scalability of each of these components in the official manual
Consider adding some warnings to the UI before triggering an operation which might require a lot of RAM (perhaps not so easy?)

Alternatives considered

Not sure?

Use standard logo as header

The version of the OpenRefine logo that is used on this website (wireframe version of the diamond) is not used anywhere else. I think OpenRefine's official website should use OpenRefine's official logo! If it does not fit with the rest of the design, then the rest of the design is wrong.

Document the creation of MediaWiki tag in Wikibase docs

Our docs about the Wikibase integration (https://docs.openrefine.org/next/manual/wikibase/configuration) do not currently mention that Wikibase admins need to create a tag (openrefine-3.x) to enable the integration.

We should ideally let users disable the need for a tag in their manifest (maybe that's another issue).

Update Website with link to bountysource

OpenRefine Bounty Source is taking off (thanks @thadguidry for setting this up!) We have regular backers and already awarded two bounties in 2017!

I think the public website deserves a link encouraging our user to donate.

Pull Request Welcome

Two-week test period for Discourse forum

It seems that our community has no major objections against moving OpenRefine's mailing lists to Discourse; see thread on our user mailing list and a larger document with considerations around this move.

So I suggest we go ahead with a two-week test period. Let's use this issue for discussion as long as we can't use the forum yet.

Migrate content from MediaWiki bot page on wiki to user manual

New user manual has details about creating a bot for Wikidata https://docs.openrefine.org/manual/wikidata#manage-wikidata-account

There is a page on the wiki that gives more general information about creating a bot account for any Wikibase instance https://github.com/OpenRefine/OpenRefine/wiki/MediaWiki-Bot-Passwords

The more general information should be migrated across to the new user manual and then the MediaWiki-Bot-Passwords page in the wiki can be deprecated (replace the content with a link to the relevant page in the new user manual)

Documentation edit process with versioned docs

We have a problem with the way we edit our documentation.

Docusaurus includes links from the documentation to GitHub so that a given page can be edited directly. That is very convenient, however it does not tie in very well with versioning. By default, the docs that are visible at docs.openrefine.org are the docs for the latest stable version, and so the edit link points to the corresponding versioned page:

https://docs.openrefine.org/manual/starting links to https://github.com/OpenRefine/OpenRefine/edit/master/docs/versioned_docs/version-3.5/manual/starting.md

However, it is important that the "current" docs are updated too, since they will become the stable docs at the next release.
Not doing so means that the changes will disappear from docs.openrefine.org at the next release.

This is affecting the following PRs:

I will check if other PRs have been affected and backport the changes to the 3.5 and current docs.

This makes me wonder:

The technical reference probably does not need this versioning mechanism. Is there a way we could exclude it from the versioning mechanism?
Maybe the versioning we are doing now is a bit too agressive: there are not so many user-visible changes between 3.4 and 3.5 (the main part is the Wikibase integration), perhaps we should rather just have separate docs for 3.x and 4.x?
Could we have some CI mechanism to flag changes which only edit versioned docs and not the current docs?

Remove Disqus comments

They are not actually used much
They use a third-party service (with all privacy implications this entails)
Disqus use it to embed unrelated content
They do not look professional in my opinion

migrate from Maruku to Kramdown

When updating the website I received the following email warning:

Your site is using Maruku, the default Markdown interpreter. Maruku is now obsolete and may cause builds to fail for sites with invalid Markdown or HTML. See https://help.github.com/articles/migrating-your-pages-site-from-maruku for more information on upgrading to a newer Markdown interpreter.

We should do the migration on a separate fork and test it before merging it back here.

Misalignment in book topics on website

I visited the OpenRefine's website, and saw this:

which probably only requires a simple fix. However, I will be going to bed soon so I thought I'd better post this here first before I forget. Feel free to fix this in the meantime.

Fix favicon to use new logo without text

The favicon currently includes some text - it should only be the blue diamond itself, as the current version is ugly and unreadable.

Update Website to indicate different language supported

We should indicate on the website that OpenRefine is available in English Spanish, Italian, French, Chinese and Japanese.

Pull Request Welcome

Advertise Debian / Ubuntu package in our docs

Our documentation does not mention yet that users of Debian derivatives (such as Ubuntu) can install OpenRefine with APT (apt install openrefine). This is much more convenient than downloading the archive, so it ought to be documented! I guess that could also be on our website too.

https://packages.debian.org/bookworm/openrefine
https://launchpad.net/ubuntu/+source/openrefine/3.5.2-1

<title> is openrefine.github.com

The title of the website should not be openrefine.github.com… It should be "OpenRefine"!

Update Support/Sponsors and make it a list

We should make a separate div list element with logos and dates next to each for our sponsors.
Or just have a Sponsors page separately...dunno...

Something like this for now...

Sponsors:

2020 Chan Zuckerberg Initiative
2018 Google News Initiative

Refresh website

We will appreciate PR to refresh the current website with new CSS and layout.
We like to stick with GitHub pages and Jekyll . Content remains the same (ie same pages, same text), we want to focus only on the style.

Quote from @thadguidry:

This would be an awesome portfolio opportunity for some teenager somewhere in the world :)

Unicode support on Regexp

Hello,

As a user, I want to deal with regex facets including accent selection into words. It is a solution to apply the regex modifier 'u' that would simplify my life. Strangely, it is possible to bypass this limitation using jython into columns creation, but it remains difficult because you have to code each transformation rather using a simple regex modifier.

As a solution, I wonder if the default modifier u could be used on project based on UTF files... For the moment, very annoying...

Best regards,

Migrate to new Twitter feed before July 27, 2018

The twitter widget we are using will be deprecated on July 27, 2018. See announcement.

Twitter offer no out of box widget to show result from a specific query. Easy option are either:

List tweets from @OpenRefine
List tweets with the tag #OpenRefine

If we want to keep the same feed we need to write our own widget using Twitter API.

Blog post asking feedback on moving OpenRefine's mailing lists and Gitter to Discourse

We consider moving OpenRefine's mailing lists and Gitter to a web-based Discourse forum, and invite community feedback. A message will be sent to OpenRefine's mailing lists on September 20, 2022. But I also want to publish a blog post (with the same message) on that same day, so that we can tweet about it and generally invite feedback from those community members who are not on our mailing list.

HTTPS for openrefine.org

Following remarks from @nilswindisch on Twitter we should support HTTPS on openrefine.org

Documentation: version only by major version

We currently offer separate documentation for each minor release of the tool (3.4, 3.5, …).
This has a few downsides:

when making improvements to the docs, we often need to edit multiple copies of the docs, which is cumbersome
not much is changing between each minor version anyway

At the same time, we do not have a good mechanism to publish the documentation of the 4.0 branch.
If we create a 4.0 version in Docusaurus, it will be the one displayed by default to the users (because it is the latest one), but since we have not published a stable release in that branch, this is undesirable (people should see 3.x first).

Proposed solution

Switch to only one documentation per major version instead. This means that at the moment, we would have version for OpenRefine 3, and the development docs would be for OpenRefine 4.

Alternatives considered

I am not sure how else we could publish docs for 4.0 with the current system.

Additional context

Also, that solution would mean that we would need to edit the docs for OpenRefine 4 on the master branch, which might be a bit confusing. If we also want to migrate our openrefine.org website to Docusaurus, perhaps this is a sign we should be using a different repository after all?

privacy.md is not listed in the sidebar

Document versions of OpenRefine we support

We will need to support both 4.x and 3.x versions of OpenRefine in the future and be able to handle Documentation, Security Patches, and perhaps using GitHub's Security Policy doc as I mentioned in PR OpenRefine/OpenRefine#2048 (now closed) with granular semantic versions (major.minor.patch)

@wetneb had suggested previously in that PR that we should probably do the following:

Given our current workflows we are basically not able to release fixes for any previous version. We would need to use branches for the major versions we want to support on the long term. That's definitely something we can consider. Each security patch would be merged in all supported major release branches to create the corresponding versions. That would probably mean using more granular release numbers (such as 3.2.0).

Verify that our chosen documentation system has support for Semantic Versioning OpenRefine/OpenRefine#2273
Decide on supported versions and branching
Verify that GitHub's Security Policy doc works on those supported branches.

Document that OOXML Strict files are not supported

As mentioned in OpenRefine/OpenRefine#2221, until that feature request is implemented, we should document that we don't support Strict mode OOXML files.

Document the workaround steps for running OpenRefine on OS's that block running unsigned apps

Until we solve OpenRefine/OpenRefine#3003 and OpenRefine/OpenRefine#4568 we should document workaround steps.

Some OS's such as MacOS and Windows have additional steps needed for users to acknowledge installing applications downloaded from untrusted sources, or unsigned applications.

Proposed solution

Until OpenRefine has the capability to be considered an approved signed application for MacOS and Windows, we should:

provide workaround steps in our docs for running OpenRefine.
As a user, test on MacOS and Windows 10/11 that the workaround steps allow OpenRefine to be run.

Alternatives considered

Additional context

For example, here is a screenshot that could be added to our docs for Windows where running OpenRefine 3.5.2 openrefine.exe for the first time on Windows 11 where the user needs to click on More info and then click the button that appears Run anyway

Document which OR versions extensions are compatible with

In the table that lists extensions, we should list the OpenRefine versions they are compatible with.

Docs: Deduplicate browser lists in manual/{installing,running}.md

Both files contain the same section about compatible browsers, see https://github.com/OpenRefine/OpenRefine/pull/5231/files for example.

Proposed solution

… have a page on browser compatibility which is linked from both locations?

Alternatives considered

Define either one to be the single source of truth, and replace the 2nd list with a link to the 1st.

Additional context

This is a follow-up to an open discussion in the above-mentioned PR.

Fix 2 deficiencies in Documentation

There are two features which as a new user I didn't really understand because they weren't on the "Documentation For Users" page.

Row mode vs Record mode. I finally figured it out thanks to this article
The difference between null and the empty string ("") -- I still don't understand this. After reading as much as I could stomach of OpenRefine/OpenRefine#820, OpenRefine/OpenRefine#1544, OpenRefine/OpenRefine#1571, I still really don't understand the fundamental difference in how they are represented and how they change the functionality of OpenRefine (e.g. during export). I have also found out that it seems easy enough to convert from null --> "" with the coalesce() function, but I don't know if it's possible to go from "" --> null or if I would ever even want to do so.

Thanks so much for an awesome tool! I have loved most of it so far and am super excited for its future!

I hope that someone can add a simple explanation of these two details to the Documentation For Users page, possibly in their own sections.

Enhance Download page experience (hard to find, unreadable, downplay non-embedded versions)

The download page should be much more prominently displayed in the menu.

On the download page itself, we want to show Windows, Mac and Linux download buttons much more prominently, for the latest stable release. Other releases can be listed elsewhere.

Downloading OpenRefine should be a complete no-brainer.

For instance:

Some people even tweet at us to find the download link! https://twitter.com/rebelsouly/status/1207044835452039168

Blog post to announce our May 2022 Outreachy interns

Publish blog post to announce our 2022 Outreachy interns.

FAQ: How much data can OpenRefine deal with?

I'm always frustrated when OpenRefine can not deal with my data.

Proposed solution

Based on thread called "Measuring scale/limits of OpenRefine" with 1853 views in https://groups.google.com/g/openrefine/c/-loChQe4CNg/m/eroRAq9_BwAJ?pli=1 add a FAQ item that answers where OpenRefine limits come from, what sacrifices are required to lift them, what efforts are being made in this direction, and how to track, measure and join these efforts.

If full answer may appear too big, maybe it is possible to create a blog post that sums up how the CZI grant was spent in regard to scaling objective, so that other funds can decide if they want to join to support the initiative. Then how is https://github.com/OpenRefine/OpenRefine/projects?type=classic related to the goal (how much improvement in measurements are expected) and how 4.0 changes the situation (which issues are most critical).

It is interesting as well to read about technical details - how OpenRefine loses memory, that available memory is not only OpenRefine, but also eaten by the browser, if OpenRefine is able to track when memory goes into swap and productivity drops? Extremely important to see the list of features that need to be ported to support new Dataflow model, which features would need to be sacrificed, which will gain speed, which will lift memory limits, which will suffer?

Alternatives considered

Would be glad to know any.

Additional context

We were discussing the absence convenient interfaces for data wrangling и data preparation tools for big data, and if it is easier to rewrite something from scratch than try to enhance OpenRefine.

Blog post about OpenRefine's governance transitions

Publish a blog post that announces OpenRefine's governance transitions (creating an 'ambassador council', other changes), the expected transitions in the next months, + asking for feedback.

Make license clearer

OpenRefine's license should be made clearer on the website. See this thread on our user mailing list where someone could not find our terms of use:
https://groups.google.com/g/openrefine/c/XPZs7UNIUs0

Document Wikibase schema format

OpenRefine lets users import and export the Wikibase schemas that they use in the tool, in the JSON format we use internally.
It could be useful to document this format which could be adopted by other tabular data integration tools.

Proposed solution

Document the overall structure of the schema in the developer docs
Offer a JSON schema

Alternatives considered

Perhaps there are reasons why we should instead want to keep this private? What sort of stability / versioning policy do we want to have here?

Additional context

Brought up in a discussion with @addshore

Expand Data Privacy page to include web site

The Data Privacy seems to focus entirely on the OpenRefine tool, but typically these also include things like web server log retention info, etc. It should probably be expanded to include at least cursory info or pointers to data privacy statements for:

openrefine.org web server
mailing lists
Discourse (if we end up going that direction)
Gitter

Even just a list of these things and statement that they have their own privacy statements would be a starting point.

CiteAs metadata

We should add the relevant metadata as required by CiteAs.org to suggest a canonical way to cite OpenRefine in scientific articles.

http://citeas.org/cite/OpenRefine

We already have a Zenodo DOI, but it has not been updated with recent versions for some reason:
https://zenodo.org/record/1059001
zenodo/zenodo#1708

Typos

From @mroswell reported at OpenRefine/OpenRefine#682

There are several typos on the OpenRefine.org home page.

activly -> actively
lastest -> latest

(Is this the proper place to report these? If not, please give guidance.)

Also:
http://openrefine.org/OpenRefine/community
writting tutorial -> writing a tutorial

"The event page also list previous presentations as catch up and source of inspiration for your presentation!"
could use a rewrite

And maybe replace
http://openrefine.org/OpenRefine/blog
with a twitter feed, since there's no blog content

Announcement for packaging projects

Let's write a blog post to announce that we are looking for people to help with Apple and Windows packaging.

New hideout theme - improve display for larger screen

We should also setup a layout for larger screens (mine is 1920x1080) so that the .container is allowed to flex out wider or if I shrink my browser window then it goes back to some minimum. Looks like the current width for the .container Box itself is too small by default on larger screens. Look at some of the examples and everything on this page to use Flexbox more effectively https://css-tricks.com/snippets/css/a-guide-to-flexbox/#flexbox-examples

initially reported here

Website theme uses <iframe> for the menu

We should use a better theme - <iframe> for this sort of use not a great practice as it hinders natural scaling.

Localization for openrefine.org

openrefine.org should be translated in various languages, just like the software is.

This might require moving out of GitHub pages if we want to auto-detect the language from browser settings.

Google Auth wiki page needs to improved

As done by @afkbrb, now we can configure the G-Data credentials from the refine.ini which is not updated in the wiki. Hence the document is still ambiguous for users as to how to finally put the G-Data credentials in the OpenRefine.
Therefore the Oauth Wiki Page should be updated which may ease the Oauth process for the users

Remove `my category` for blog article

currently we need to tag blog article with

---
  category: My Category
---

It is reflected in the URL of each article: http://openrefine.org/my%20category/

I think this is a set up in jekyll to change.

-i 0.0.0.0 not working since 3.5?

Using 3.4.1 (docker run -p 3333:3333 felixlohmeier/openrefine:3.4.1), curl localhost:333 works.

Using 3.5.0, curl localhost:333 hangs. Is it because of -i 0.0.0.0 not being properly handled in 3.5.1??

Revamp / cleanup of OpenRefine's Wikibase documentation

OpenRefine started off with Wikidata support, and is now also increasingly supporting arbitrary Wikibases and now also Wikimedia Commons (yay!)

This is great, but our Wikibase documentation is still structured as a remnant of the old Wikidata-centric situation. Wikibase documentation has been added over time in various places, and in OpenRefine/OpenRefine#5023 I shoehorned in some documentation focused on Wikimedia Commons. But as a whole, IMO the documentation has become a bit inconsistent and potentially confusing for end users. I think it would be great to dedicate some time to a future-proof re-edit of that entire section.

Proposed solution

Fresh reorganization of the Wikibase support section. Some of my thoughts here:

I think it's good to often explicitly mention both Wikibase, Wikidata, and Wikimedia Commons. Wikidatans and Wikimedia Commons users may not be familiar with the term 'Wikibase' at all.
I have a hunch that we are mostly serving three distinct groups of users with this documentation: Wikibase users, Wikidatans, Wikimedia Commons editors. This may need a bit more research and testing, but I think it does make sense to provide various entry points if these are indeed large distinct groups (even if the back-end features, and the Wikibase extension as a whole, are considered as a holistic architecture from the technical side).

Additional context

Maybe we can fit this into the upcoming Wikibase-focused work (roughly July-December 2022) which will be funded by NFDI.

openrefine / openrefine.org Goto Github PK

openrefine.org's Introduction

OpenRefine

Download

Snapshot releases

Run from source

Documentation

Contributing to the project

Contact us

Licensing and legal issues

Credits

openrefine.org's People

Contributors

Stargazers

Watchers

Forkers

openrefine.org's Issues

Proposal

Original Comment

Additional context

Proposed solution

Proposed solution

Alternatives considered

Proposed solution

Alternatives considered

Additional context

Proposed solution

Alternatives considered

Additional context

Proposed solution

Alternatives considered

Additional context

Proposed solution

Alternatives considered

Additional context

Proposed solution

Alternatives considered

Additional context

Proposed solution

Additional context

Recommend Projects

Recommend Topics

Recommend Org