Comments (22)
Fantastic @anelda Will check this afternoon.
from covid19africa.
@vukosim I can take some time this morning to write a script for the conversion from the BU data to CSV line lists per country. Can give feedback by 10am to confirm progress.
from covid19africa.
@anelda Hey thank you.
from covid19africa.
The country line lists are in https://github.com/anelda/africa_covid - you'll find them in the data/country_line_lists folder
.
- I used OpenRefine 3.2 and R to clean it up and separate into different files.
- The columns that are present in the BU data but not in the line_list templates are all merged into the notes_for_discussion column with
column_name
=column_value
. - If you cast a quick eye over this, I can look at merging with existing data over the weekend or later
- If someone else wants to take a stab at merging with existing data, that would be great!
- I can create a pull request with all countries' CSVs by making a sub-folder
bu-linelists
in the data folder https://github.com/dsfsi/covid19africa/tree/master/data if that would make it easier for now. Once the data is merged, we can delete that subfolder again?
Let me know if you find errors or want anything to change.
from covid19africa.
@ensoesie Please take a look and if you have some hands at BU who can double-check before we fuse the datasets with the repo.
from covid19africa.
The data does not seem to have been extracted correctly. For example, all the cases are missing date of confirmation.
from covid19africa.
The data does not seem to have been extracted correctly. For example, all the cases are missing date of confirmation.
Hi @ensoesie, thanks for checking the data integrity.
There are several columns for dates in the template - the first column is called date
and I don't know what date is supposed to go in there. @vukosim maybe you can clarify for people who are putting together linelists what this date column is referring to please?
Later in the file there are date columns called:
date_onset_symptoms
date_admission_hospital
date_confirmation
These columns are completed if the original file contained the data in an obvious way. Some of the info may still be hidden in the notes_for_discussion
column.
I've checked a few countries' files and they all contain the data for date of confirmation.
from covid19africa.
from covid19africa.
Also, we shouldn't directly merge that data with linelists right now. How about keeping them as separate linelists in different folders so we can allow folks writing analytics scripts and notebooks to load from these multiple sources to merge during analytics time? This will help avoid many conflicts in the linelists
from covid19africa.
- the first date is date the cases are reported by the health departments, this usually could be same date as the confirmation but not necessarily in some cases.
Thanks @esube! So the raw BU data has a column called Date Reported/Confirmed
which in your explanation fits with both the date
and the date_confirmation
columns in the line list template but we wouldn't know which one. I've decided to write all those dates to the date_confirmation
column but we can change that. What do you think? It's easy enough to change in the script and upload the new data.
There is a lot of data missing based on the differences between the line list template and the BU data format but I guess what is there can always be updated from other sources. It's a good start?
from covid19africa.
Also, we shouldn't directly merge that data with linelists right now. How about keeping them as separate linelists in different folders so we can allow folks writing analytics scripts and notebooks to load from these multiple sources to merge during analytics time? This will help avoid many conflicts in the linelists
Maybe it's easy enough to merge while the number of cases is still low for most of the African countries and it will just add a layer of complexity as time goes by? I can take a stab at it before the weekend. It's only 13 countries that will need merging (there are only line lists for 13 at the moment including SA and I don't think we have to update SA as we can get that from the covid19sa repo which is probably the most trustworthy source for SA data at the moment?)
from covid19africa.
On the other hand, as time goes by, various lists may emerge and maybe it is a good idea for developers to include the option to merge lists as part of their workflow right from the start? I don't know... What do you think?
from covid19africa.
@esube is correct. If the date of report is not listed, use the date of confirmation.
from covid19africa.
For countries that currently have no line list, this file could be used to start data collection for those countries.
from covid19africa.
@anelda there is two ways we can do this:
-
We can track this data separately as separate line list in different folder. I am trying to restructure the repo to accomodate multiple data sources tracked separately and when matured can be merged. Instead of separate dev branches, I am restructuring the master. I will update today the changes and hopefully that will make it easier to track multiple sources for the line list. As @ensoesie said above, with this approach, we can still start from the BU data if a country doesn't exist in the current line lists.
-
We can merge the BU data directly to the existing line lists.
My preference is the first.
Re: I am also locally standardizing the column names and adding to the README what each column means. I will update that today and tomorrow
from covid19africa.
@anelda I have restructured the data folder to make it more intuitive and to start standardizing formats, headers, documentation, etc. Now, there is a folder under data/line_lists/bu_line_lists
. You can use that folder to have line lists that may be duplicate with country line lists that exist in data/line_lists/current
.
- For countries that don't currently have line lists, feel free to move the bu data to the current folder.
- See
data/time_series/
anddata/time_series/africa_cdc
for example of redundant information. I am extracting daily reporting of time-series data from Africa CDC Twitter page using thescripts/update.py
. Other contributors are also maintaining separate set of time-series they are updating from different sources. Merging these would create a huge mess as one is automatically updated and the others are probably manually updated. Instead, the two sets of time-series lines could be complimentary to each other to fill missing data from one to the other and vice versa. It is terribly easy to load both sets using a python script or notebook and do analysis merging data on the fly in a complementary manner. I strongly believe it is beneficial to keep these separate as data source for analysis to clearly see missing data and fill gaps in a complementary fashion.
However, this is all suggestion, feel free to put the BU data the way you see is more appropriate.
from covid19africa.
@esube is correct. If the date of report is not listed, use the date of confirmation.
@esube @ensoesie just to confirm what is required here. The BU data contains a single column called Date Reported/Confirmed
. There is therefore no way to discriminate between date reported and date confirmed from the original data to have information in both target columns in the template (i.e. date
and date_confirmation
). Should I copy the same date from the original dataset in the Date Reported/Confirmed
column to both the target columns in the template?
If you prefer this, I can update the files this morning and add the line lists here.
from covid19africa.
@anelda I have restructured the data folder to make it more intuitive and to start standardizing formats, headers, documentation, etc. Now, there is a folder under
data/line_lists/bu_line_lists
. You can use that folder to have line lists that may be duplicate with country line lists that exist indata/line_lists/current
.
Thanks @esube for explaining your workflows and for updating the folder structure. This looks very tidy now and I think new comers to the repo would also find it easy to understand. I will gladly add the BU data to your folder bu_line_list
. Shall I add the original Excel file there as well for reference?
I can also add the line lists from this source to the parent line_lists
folder where a line list for a country does not yet exist.
I'll wait to hear back from you about my previous question before proceeding.
from covid19africa.
from covid19africa.
@esube is correct. If the date of report is not listed, use the date of confirmation.
@esube @ensoesie just to confirm what is required here. The BU data contains a single column called
Date Reported/Confirmed
. There is therefore no way to discriminate between date reported and date confirmed from the original data to have information in both target columns in the template (i.e.date
anddate_confirmation
). Should I copy the same date from the original dataset in theDate Reported/Confirmed
column to both the target columns in the template?
Yes, please. In some linelists, we use the same date for those. So, feel free to copy and fill the two columns in the target with same date.
from covid19africa.
MVP @anelda. @ensoesie Thank you and do suggest next steps.
from covid19africa.
Thanks all. Closing this.
from covid19africa.
Related Issues (18)
- [Data] Add data sources for African Country Cases HOT 1
- [Feature] Selector for countries HOT 1
- [Feature] Here's a map I made HOT 1
- [Bug/Error] Fix Namibia numbers HOT 1
- Uploading Time series data HOT 1
- [Data] African Line List HOT 5
- [Feature] John Hopkins data. HOT 2
- [Feature] Analysis Notebooks HOT 1
- [Data] Updating HOT 2
- [Feature] African Line List Sourcing Information HOT 10
- Covid19 daily report for Africa and Dashboard
- Metadata Update
- [Feature] number of tests per day per country in the dataset HOT 2
- [Data] Create csv of daily reports across countries HOT 3
- [Data] Open health facility data in Africa HOT 5
- [Data] Validation of line list data for a few countries HOT 3
- Time Series Africa Data (Death and Recovered) HOT 5
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from covid19africa.