Git Product home page Git Product logo

Comments (15)

ValentineHerr avatar ValentineHerr commented on August 28, 2024

errors when loading again, as input, the dataset and profile just returned by the ap

Did they use the "App's profile" in Headers and Units instead of their profile (as returned by the app after the first interaction with it)?

from dataharmonization.

gabrielareto avatar gabrielareto commented on August 28, 2024

I did not record that.

Shouldn't it work both ways?

from dataharmonization.

ValentineHerr avatar ValentineHerr commented on August 28, 2024

no.
The user's profile is mapping user data to standard data.
The app's profile is mapping standard data to standard data. (no conversion).

If the user inputs standard data and use is own profile, the app will look for the user's column (the original ones) in a file that does not have them and thus cause all kind of issues.

from dataharmonization.

gabrielareto avatar gabrielareto commented on August 28, 2024

right, thanks!

It is a bit confusing that the output is [data in format 2] and [profile that describes format 1].

It is also a vulnerable point, because the user can forget (or lose) the format of their output.

I think it would be easier for the user if we work with pairs of {data in format X, description of format X}. For example if the output is structured into subfolders:

  • original/data/ : all the original tables as provided by the user
  • original/metadata/ : with the profile that describes the original data of the user, the metadata table, ...
  • processed/data/ : the processed data, the current output
  • processed/metadata/: with the profile that describes the output data (either the app's default, or any of the defaults, or as provided by the user as their desired format for the output).

would this be a problem?

from dataharmonization.

ValentineHerr avatar ValentineHerr commented on August 28, 2024

I will try to do that

from dataharmonization.

ValentineHerr avatar ValentineHerr commented on August 28, 2024

I will try to do that

from dataharmonization.

gabrielareto avatar gabrielareto commented on August 28, 2024

related to this idea of pairs of {data in format X, description of format X}:

it can be more clear for the user, to understand how the app works, if they declare the input profile and the output profile at the same time, not in different tabs. Something like:

1- load your data
2- profiles etc.

do you have a description of your input? Pick from the list, or provide a rds object that blah blah blah.

  • if in the list, pick it.
  • if not on the list: load it. It can be an incomplete profile, you can update it later.
  • if you do not have it: you will create it in tab "headers and units".

do you have a description of your desired output? Pick from the list, or provide a rds object that blah blah blah. This is typically a consensus format in collaborations, and may be provided to you by the person(s) responsible for data aggregation. It could also be the format of a network or repository in which you want to integrate your data.

  • if in the list, pick it.
  • if not on the list: load it.
  • if you do not have it: the app will return data in a default standard format.

3- headers and units. In case you want to create/update the description of the input data.

etc.

are there reasons why this could be more difficult for us or the users than the current design?

from dataharmonization.

ValentineHerr avatar ValentineHerr commented on August 28, 2024

here is what the files are in the output now (for a case where I uploaded 3 tables to stack and merge, and gave them the name you see):

image

from dataharmonization.

gabrielareto avatar gabrielareto commented on August 28, 2024

I think this helps, thanks. Can you try to pass through the app again using output_data.csv and outputProfile.rds as your new input? For example, as if you were to run the corrections on a second phase. It should always work, right?

from dataharmonization.

ValentineHerr avatar ValentineHerr commented on August 28, 2024

I tried. The app works but it does show a big warning that the profile doesn't seem to apply to the data.

I lowered my error alert, so the error doesn't show unless the match between data and profile is really bad. And I added some info in the profile to know when the uploaded profile was originally the app's profile.

For more context, the app's profile has all possible entries filled out, but I only give the relevant columns in the processed data to the user, so when the user uploads that and uses the app's profile as input, there are a lot of columns "missing" compared to what that profile expects (other custom profiles would have most entries as "none"). I had a threshold of 20 columns missing, over which it could really be that the data is not in the app's profile format. But since I have been adding a lot of column options over time, I think 20 became too restrictive. I changed it to 50.

from dataharmonization.

ValentineHerr avatar ValentineHerr commented on August 28, 2024

let me know if you get more feedback on this issue. I'll close it for now.

from dataharmonization.

gabrielareto avatar gabrielareto commented on August 28, 2024

I understand, thanks.

I think it is ok as it is, as long as the warning just means that. The data federation will require proper merging, not just stacking. That is a good practice.

Some thoughts for the record:

Under the approach of having pairs of {data in format X, description of format X}, it would be desirable to have perfect match between data and its description. That would require subsetting the columns in the same way both in the data and in the profile.

Proper merging would work exactly the same, but a test of "is profile 1 = profile 2" would return FALSE. Is there any need for profiles that refer to different subsets of columns to be exactly the same? Can we foresee any step in a real data federation project in which the match between profiles (.rds objects) can play a role?

One alternative is to never delete empty columns, and return sparse datasets that (1) match perfectly the complete version of the profile and, therefore, (2) could be merged by simply stacking. This seems a more general solution. For example, teams could use the stacking step as a way to merge their datasets within the app, before running the corrections on the aggregated data. How important is saving space for us?

from dataharmonization.

ValentineHerr avatar ValentineHerr commented on August 28, 2024

That would require subsetting the columns in the same way both in the data and in the profile.

I thought about that but profiles hold more than column names. They also have units info, date format info etc...

from dataharmonization.

ValentineHerr avatar ValentineHerr commented on August 28, 2024

getting a whole bunch of empty columns is not pleasant.... I originally had that but got negative feedback on it so I removed them....

We could have it as an option so people can use it as you say... but in that case we would need to get rid of all the "_original" columns, which will be dataset specific...

from dataharmonization.

gabrielareto avatar gabrielareto commented on August 28, 2024

ok, let's keep it as it is now. Thanks!

from dataharmonization.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.