Git Product home page Git Product logo

Comments (22)

vukosim avatar vukosim commented on July 18, 2024 1

Fantastic @anelda Will check this afternoon.

from covid19africa.

anelda avatar anelda commented on July 18, 2024

@vukosim I can take some time this morning to write a script for the conversion from the BU data to CSV line lists per country. Can give feedback by 10am to confirm progress.

from covid19africa.

vukosim avatar vukosim commented on July 18, 2024

@anelda Hey thank you.

from covid19africa.

anelda avatar anelda commented on July 18, 2024

The country line lists are in https://github.com/anelda/africa_covid - you'll find them in the data/country_line_lists folder.

  • I used OpenRefine 3.2 and R to clean it up and separate into different files.
  • The columns that are present in the BU data but not in the line_list templates are all merged into the notes_for_discussion column with column_name=column_value.
  • If you cast a quick eye over this, I can look at merging with existing data over the weekend or later
  • If someone else wants to take a stab at merging with existing data, that would be great!
  • I can create a pull request with all countries' CSVs by making a sub-folder bu-linelists in the data folder https://github.com/dsfsi/covid19africa/tree/master/data if that would make it easier for now. Once the data is merged, we can delete that subfolder again?

Let me know if you find errors or want anything to change.

from covid19africa.

vukosim avatar vukosim commented on July 18, 2024

@ensoesie Please take a look and if you have some hands at BU who can double-check before we fuse the datasets with the repo.

from covid19africa.

ensoesie avatar ensoesie commented on July 18, 2024

The data does not seem to have been extracted correctly. For example, all the cases are missing date of confirmation.

from covid19africa.

anelda avatar anelda commented on July 18, 2024

The data does not seem to have been extracted correctly. For example, all the cases are missing date of confirmation.

Hi @ensoesie, thanks for checking the data integrity.

There are several columns for dates in the template - the first column is called date and I don't know what date is supposed to go in there. @vukosim maybe you can clarify for people who are putting together linelists what this date column is referring to please?

Later in the file there are date columns called:

  • date_onset_symptoms
  • date_admission_hospital
  • date_confirmation

These columns are completed if the original file contained the data in an obvious way. Some of the info may still be hidden in the notes_for_discussion column.

I've checked a few countries' files and they all contain the data for date of confirmation.

from covid19africa.

esube avatar esube commented on July 18, 2024

from covid19africa.

esube avatar esube commented on July 18, 2024

Also, we shouldn't directly merge that data with linelists right now. How about keeping them as separate linelists in different folders so we can allow folks writing analytics scripts and notebooks to load from these multiple sources to merge during analytics time? This will help avoid many conflicts in the linelists

from covid19africa.

anelda avatar anelda commented on July 18, 2024
  • the first date is date the cases are reported by the health departments, this usually could be same date as the confirmation but not necessarily in some cases.

Thanks @esube! So the raw BU data has a column called Date Reported/Confirmed which in your explanation fits with both the date and the date_confirmation columns in the line list template but we wouldn't know which one. I've decided to write all those dates to the date_confirmation column but we can change that. What do you think? It's easy enough to change in the script and upload the new data.

There is a lot of data missing based on the differences between the line list template and the BU data format but I guess what is there can always be updated from other sources. It's a good start?

from covid19africa.

anelda avatar anelda commented on July 18, 2024

Also, we shouldn't directly merge that data with linelists right now. How about keeping them as separate linelists in different folders so we can allow folks writing analytics scripts and notebooks to load from these multiple sources to merge during analytics time? This will help avoid many conflicts in the linelists

Maybe it's easy enough to merge while the number of cases is still low for most of the African countries and it will just add a layer of complexity as time goes by? I can take a stab at it before the weekend. It's only 13 countries that will need merging (there are only line lists for 13 at the moment including SA and I don't think we have to update SA as we can get that from the covid19sa repo which is probably the most trustworthy source for SA data at the moment?)

from covid19africa.

anelda avatar anelda commented on July 18, 2024

On the other hand, as time goes by, various lists may emerge and maybe it is a good idea for developers to include the option to merge lists as part of their workflow right from the start? I don't know... What do you think?

from covid19africa.

ensoesie avatar ensoesie commented on July 18, 2024

@esube is correct. If the date of report is not listed, use the date of confirmation.

from covid19africa.

ensoesie avatar ensoesie commented on July 18, 2024

For countries that currently have no line list, this file could be used to start data collection for those countries.

from covid19africa.

esube avatar esube commented on July 18, 2024

@anelda there is two ways we can do this:

  1. We can track this data separately as separate line list in different folder. I am trying to restructure the repo to accomodate multiple data sources tracked separately and when matured can be merged. Instead of separate dev branches, I am restructuring the master. I will update today the changes and hopefully that will make it easier to track multiple sources for the line list. As @ensoesie said above, with this approach, we can still start from the BU data if a country doesn't exist in the current line lists.

  2. We can merge the BU data directly to the existing line lists.

My preference is the first.

Re: I am also locally standardizing the column names and adding to the README what each column means. I will update that today and tomorrow

from covid19africa.

esube avatar esube commented on July 18, 2024

@anelda I have restructured the data folder to make it more intuitive and to start standardizing formats, headers, documentation, etc. Now, there is a folder under data/line_lists/bu_line_lists. You can use that folder to have line lists that may be duplicate with country line lists that exist in data/line_lists/current.

  • For countries that don't currently have line lists, feel free to move the bu data to the current folder.
  • See data/time_series/ and data/time_series/africa_cdc for example of redundant information. I am extracting daily reporting of time-series data from Africa CDC Twitter page using the scripts/update.py. Other contributors are also maintaining separate set of time-series they are updating from different sources. Merging these would create a huge mess as one is automatically updated and the others are probably manually updated. Instead, the two sets of time-series lines could be complimentary to each other to fill missing data from one to the other and vice versa. It is terribly easy to load both sets using a python script or notebook and do analysis merging data on the fly in a complementary manner. I strongly believe it is beneficial to keep these separate as data source for analysis to clearly see missing data and fill gaps in a complementary fashion.

However, this is all suggestion, feel free to put the BU data the way you see is more appropriate.

from covid19africa.

anelda avatar anelda commented on July 18, 2024

@esube is correct. If the date of report is not listed, use the date of confirmation.

@esube @ensoesie just to confirm what is required here. The BU data contains a single column called Date Reported/Confirmed. There is therefore no way to discriminate between date reported and date confirmed from the original data to have information in both target columns in the template (i.e. date and date_confirmation). Should I copy the same date from the original dataset in the Date Reported/Confirmed column to both the target columns in the template?

If you prefer this, I can update the files this morning and add the line lists here.

from covid19africa.

anelda avatar anelda commented on July 18, 2024

@anelda I have restructured the data folder to make it more intuitive and to start standardizing formats, headers, documentation, etc. Now, there is a folder under data/line_lists/bu_line_lists. You can use that folder to have line lists that may be duplicate with country line lists that exist in data/line_lists/current.

Thanks @esube for explaining your workflows and for updating the folder structure. This looks very tidy now and I think new comers to the repo would also find it easy to understand. I will gladly add the BU data to your folder bu_line_list. Shall I add the original Excel file there as well for reference?

I can also add the line lists from this source to the parent line_lists folder where a line list for a country does not yet exist.

I'll wait to hear back from you about my previous question before proceeding.

from covid19africa.

esube avatar esube commented on July 18, 2024

from covid19africa.

esube avatar esube commented on July 18, 2024

@esube is correct. If the date of report is not listed, use the date of confirmation.

@esube @ensoesie just to confirm what is required here. The BU data contains a single column called Date Reported/Confirmed. There is therefore no way to discriminate between date reported and date confirmed from the original data to have information in both target columns in the template (i.e. date and date_confirmation). Should I copy the same date from the original dataset in the Date Reported/Confirmed column to both the target columns in the template?

Yes, please. In some linelists, we use the same date for those. So, feel free to copy and fill the two columns in the target with same date.

from covid19africa.

vukosim avatar vukosim commented on July 18, 2024

MVP @anelda. @ensoesie Thank you and do suggest next steps.

from covid19africa.

vukosim avatar vukosim commented on July 18, 2024

Thanks all. Closing this.

from covid19africa.

Related Issues (18)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.