usepa / flowsa Goto Github PK

Library that attributes resource use, waste, emissions, and loss to economic sectors

License: MIT License

Python 100.00%

flowsa's Introduction

flowsa

flowsa is a data processing library attributing the flows of resources (environmental, monetary, and human), wastes, emissions, and losses to sectors, typically NAICS codes. flowsa aggregates, combines, and allocates data from a variety of sources. The sources can be found in the GitHub wiki under "Flow-By-Activity Datasets".

flowsa helps support USEEIO as part of the USEEIO modeling framework. The USEEIO models estimate potential impacts of goods and services in the US economy. The Flow-By-Sector datasets created in FLOWSA are the environmental inputs to useeior.

Usage

Flow-By-Activity (FBA) Datasets

Flow-By-Activity datasets are formatted tables from a variety of sources. They are largely unchanged from the original data source, except for formatting. A list of available FBA datasets can be found in the Wiki.

import flowsa
Return list of all available FBA datasets, including years flowsa.seeAvailableFlowByModels('FBA')
Generate and return pandas dataframe for 2014 Energy Information Administration (EIA) Manufacturing Energy Consumption Survey (MECS) land use
fba = flowsa.getFlowByActivity(datasource="EIA_MECS_Land", year=2014)

Flow-By-Sector (FBS) Datasets

Flow-By-Sector datasets are tables of environmental and other data attributed to sectors. A list of available FBS datasets can be found in the Wiki.

import flowsa
Return list of all available FBS datasets flowsa.seeAvailableFlowByModels('FBS')
Generate and return pandas dataframe for national water withdrawals attributed to 6-digit sectors. Download all required FBA datasets from Data Commons.
fbs = flowsa.getFlowBySector('Water_national_2015_m1', download_FBAs_if_missing=True)

Examples

Additional example code can be found in the examples folder.

Installation

pip install git+https://github.com/USEPA/[email protected]#egg=flowsa

where vX.X.X can be replaced with the version you wish to install under Releases.

Additional Information on Installation, Examples, Detailed Documentation

For more information on flowsa see the wiki.

Accessing datsets output by FLOWSA

FBA and FBS datasets can be accessed on EPA's Data Commons without running the Python code.

Disclaimer

The United States Environmental Protection Agency (EPA) GitHub project code is provided on an "as is" basis and the user assumes responsibility for its use. EPA has relinquished control of the information and no longer has responsibility to protect the integrity, confidentiality, or availability of the information. Any reference to specific commercial products, processes, or services by service mark, trademark, manufacturer, or otherwise, does not constitute or imply their endorsement, recommendation or favoring by EPA. The EPA seal and logo shall not be used in any manner to imply endorsement of any commercial product or activity by EPA or the United States Government.

flowsa's People

Contributors

Stargazers

Watchers

Forkers

modelearth cchiq ericmbell1 hottleta strogo ealonso-mfa catherinebirney johnandrewtaylor jchou18 wesingwersen andychase wadedavis13 wuqi001s moli7 matthewlchambers jbousquin ysrivas08 showandgo elanphearerg

flowsa's Issues

esupy 'dependency conflict'

Howdy,
I'm receiving the following error when following the installation instructions.

Collecting esupy@ git+git://github.com/USEPA/[email protected]#egg=esupy
  Cloning git://github.com/USEPA/esupy (to revision v0.1.7) to /tmp/pip-install-v0p7lwqy/esupy_0e824dc63ad04367b1ae3e24b3bdde5d
  Running command git clone -q git://github.com/USEPA/esupy /tmp/pip-install-v0p7lwqy/esupy_0e824dc63ad04367b1ae3e24b3bdde5d
  Running command git checkout -q c04efa5aefc82a317776a2b2b1b20fae01b5fce7
INFO: pip is looking at multiple versions of fedelemflowlist to determine which version is compatible with other requirements. This could take a while.
INFO: pip is looking at multiple versions of esupy to determine which version is compatible with other requirements. This could take a while.
INFO: pip is looking at multiple versions of flowsa to determine which version is compatible with other requirements. This could take a while.
ERROR: Cannot install flowsa and flowsa==0.2.1 because these package versions have conflicting dependencies.

The conflict is caused by:
    flowsa 0.2.1 depends on esupy 0.1.7 (from git+https://github.com/USEPA/[email protected]#egg=esupy)
    stewi 0.9.9 depends on esupy 0.1.7 (from git+git://github.com/USEPA/[email protected]#egg=esupy)

To fix this you could try to:
1. loosen the range of package versions you've specified
2. remove package versions to allow pip attempt to solve the dependency conflict

steps to recreate

$ mkdir flowsa-test
$ python3 -m venv flowsa-venv
$ source flowsa-venv/bin/activate
$ pip install git+https://github.com/USEPA/flowsa

pip version 21.0.1
python version 3.9.6
Following the installation instructions outside of a virtual environment was also unsuccessful.

I was able to successfully install a fork by changing the 'git' to 'https' in setup.py but then I ran into more import problems and wasn't sure if i was causing more problems than I was fixing.

Need warning when attempting to obtain data for unavailable year

fba = flowsa.getFlowByActivity(datasource="EIA_MECS_Energy", year=2017)
returns a bad zip file error because 2017 doesn't exist for EIA_MECS

Perhaps flowbyactivity could check against the list of years in the FBA yaml. This would also help prevent errors when new data are released (e.g. 2018 for EIA_MECS) but not yet tested

consider moving methods to separate folder

I find the methods can get a little bit lost within the data folder next to many of the other subfolders. Would it make sense to move flowbysectormethods and flowbyactivitymethods up a level into a folder called methods?

flowbysector: Handle cases when BLS employees are missing for a sector-geography but there are flows

Underutilization of `.yaml`s

Raising this as an issue now for awareness and discussion. I think we are currently hardcoding too many things in the data source script .py files and not making enough use of the .yaml files. For example, information on how to modify the NEI data url for different years, or information on how columns should be renamed (again, I'm thinking about the NEI data source in particular at the moment) and how the renaming depends on the year, could be included as dicts in the EPA_NEI.yaml files.

One advantage of offloading this information to the .yaml files is that it should simplify the process of adding new data years (e.g. the 2020 NEI, or 2018 and 2019 EQUATES data).

I'm working on the FBA methods and scripts for the EQUATES data right now, and I'll link to them once I'm finished, by way of an example of what I'm thinking.

requirements.txt compatibility error between flowsa and esupy

Differences in package version requirements between flowsa and esupy throw errors
See:
https://github.com/USEPA/esupy/blob/main/requirements.txt
and
https://github.com/USEPA/flowsa/blob/master/requirements.txt

@WesIngwersen

URL concatentation missing '&' when forming Quickstats

When indirectly calling on the code to generate the FBA for the CoA Cropland, it creates this URL where the xxxx is a valid API key

2022-01-11 10:50:10 INFO     Calling https://quickstats.nass.usda.gov/api/api_GET/?key=xxxxEsource_desc=CENSUS&sector_desc=ECONOMICS&statisticcat_desc=AREA%26statisticcat_desc%3DAREA+OPERATED&commodity_desc=AG+LAND%26commodity_desc%3DFARM+OPERATIONS&unit_desc=ACRES%26unit_desc%3DOPERATIONS&agg_level_desc=NATIONAL&year=2017
ERROR Error in URL request!

I can see that after the key value there is no ampersand to indicate the next URL parameter

Error when all flows are 0 for an activity

For a particular activity, all flows are reported as zero. When that flow_subset is passed to agg_by_geoscale (here), all flows that are 0 get dropped after aggregating. This generates an error later when sectors are added to that flow_subset.

Location for EIA_MECS_Land

Location is hard coded but needs to use the variable instead of string
https://github.com/USEPA/flowsa/blob/master/flowsa/EIA_MECS.py#L157

Check that NAICS coming in from NAICS like sources is present in our NAICS list for the given year

This is an issue discovered by @bl-young with a NAICS code coming in from RCRA via stewi that was not a valid 2012 NAICS code.
Related issue is here USEPA/useeior#83

In any NAICS like sources we need to check that NAICS codes are in our NAICS code list. If they are not present, we probably need to check if they are present in an older NAICS schema and see if we can apply a mapping to get it into the current NAICS schema (2012 at the moment)

log files saved with _None in filename

If no git hash is identified, the log files are stored as e.g.,:
CRHW_national_2017_v0.3_None.log

When saving files in esupy there is a check first for this parameter before appending to the name.
https://github.com/USEPA/esupy/blob/main/esupy/processed_data_mgmt.py#L242-L243

Air transportation emissions (NEI) misaligned

The NEI captures emissions from landings and takeoffs (LTO) which are assigned to airports in the NEI point dataset. In most cases, these would end up assigned to NAICS that would get assigned to 48A000 - Scenic and sightseeing transportation and support activities for transportation. Instead they should be assigned to Air Transportation (481000)

See NEI TSD: https://www.epa.gov/sites/production/files/2018-06/documents/nei2014v2_tsd_09may2018.pdf; section 3.2

Emissions from LTO are noted by specific SCCs, primarily 2275020000

pandas.np deprecation

Starting here and used in this function is pd.np.where

flowsa/flowsa/USGS_NWIS_WU.py

Line 105 in 72b7b18

 df.loc[:, 'FlowName'] = pd.np.where(df.Description.str.contains("fresh"), "fresh", 

Its throwing a warning of future deprecation.
FutureWarning: The pandas.np module is deprecated and will be removed from pandas in a future version. Import numpy directly instead

datachecks FBA/FBS comparison throwing error

I'm getting an error occured message in this try/except here

module not found error

I believe the nested structure of the .py files leads to import issues
see: https://github.com/USEPA/flowsa/runs/3907368775?check_suite_focus=true

I think this can be resolved by including inti.py in each subdirectory (e.g. data_source_scripts)

https://sweetcode.io/python-file-importation-multi-level-directory-modules-packages/

USDA_CoA_Cropland has mixed flow classes

The OPERATIONS flows are not a unit of Land, but rather of Other, so these should be stored in a separate parquet file. at this point flowsa doesn't support having more than one class stored together in a parquet. If that needs to change then we need to change source catalog to have a list for class instead of a string

materialflowlist import error

Running "CRHW_national_2017" returns error:

import materialflowlist as mfl

ModuleNotFoundError: No module named 'materialflowlist'

@bl-young We need to add materialflowlist to package requirements/setup. Do you have a specific version? Or did you want to make this package an optional requirement?

Missing function disaggregate_usda_coa_cropland_naics

flowsa/flowsa/flowbysector.py

Line 38 in 9a22cb0

from flowsa.USDA_CoA_Cropland_NAICS import disaggregate_usda_coa_cropland_naics

from flowsa.USDA_CoA_Cropland_NAICS import disaggregate_usda_coa_cropland_naics
ImportError: cannot import name 'disaggregate_usda_coa_cropland_naics' from 'flowsa.USDA_CoA_Cropland_NAICS'

Consider only allowing getFlowByActivity to take a single year and use an int

This function as of now is just messy to send a list as opposed to a single year when we don't have any use cases for multiple years.
It would be just as easy for the user to call the function multiple times for multiple years (like in a loop) and concatenate.

If it takes a single year it will also be more consistent with getFlowBySector.

get fba subset issue

When using USDA crop data as allocation source, need to aggregate down from 7 --> 6 digit NAICs for cases when the crosswalk is already based on a six digit NAICS.

Similarly need to aggregate up from 5 --> 6.

That is, an activity that is split between two NAICS eg: 111140 & 111920 in the crosswalk, will get missed in the get_fba_allocation_subset because it only appears as 11192.
Or a crosswalk for 111120 will get missed as these only appear as 111120A

NAICS in stewicombo highly sensitive to FRS data

Assigning stewicombo data to sectors uses FRS NAICS assignments. In some cases, multiple NAICS are listed for an individual dataset. The first listed NAICS for a specific inventory in the FRS system is used.

In some cases the available NAICS are wildly different.

flowsa/flowsa/data_source_scripts/stewiFBS.py

Lines 86 to 94 in c31cd44

 # use NAICS from facility matcher so drop them here 

 facility_mapping.drop(columns=['NAICS'], inplace=True) 

 # merge dataframes to assign facility information based on facility IDs 

 df = pd.merge(df, facility_mapping, how='left', 

 on='FacilityID') 

 all_NAICS = obtain_NAICS_from_facility_matcher(inventory_list) 

 df = pd.merge(df, all_NAICS, how='left', on=['FRS_ID', 'Source'])

version is outdated

Version # in setup still show v0.0.1 even though there is a tag/release for a previous version as v0.0.2

flowsa/setup.py

Line 5 in c6032b8

version='0.0.1',

Consider when to convert units and names in flowbysector

We are still trying to decide the best way to handle both the unit converts of flows and the mapping to the FEDEFL in the flowbysector logic

@catherinebirney

Transportation Satellite Account FBA not generating properly

I've verified that the TSA data frame is correctly parsed by the tsa_parse() function from BTS_TSA.py, but then the FBA is empty. So somewhere between the data frame being parsed and the final FBA being written, all the rows are dropped. I've spent some time on it, but I may need some help figuring out this issue.

flownames in EPA GHG fbas

FlowNames stored in GHG fbas need to be corrected
Flownames in the GHGI generally are CO2, CH4, N2O for main gases

reported from stored fba files

examples:
in T_2_1: Recent_Trends... this must be replaced with the gas name like above
in T_3_10: CH4 Emissions from Stationary Combustion should be CH4

These all need to be reviewed and fixed.
fyi @catherinebirney

Missing county-level FIPS code

"02270" "46113" "51515" exist in BEA State Employment but not in /flowsa/data/FIPS.csv.
@WesIngwersen I will work with @catherinebirney to investigate this.

XLRD issue.

File "C:\Users\MelissaC\Envs\flowsa\lib\site-packages\xlrd_init_.py", line 170, in open_workbook
raise XLRDError(FILE_FORMAT_DESCRIPTIONS[file_format]+'; not supported')
xlrd.biffh.XLRDError: Excel xlsx file; not supported

Occurred when running --year 2012 --source EIA_CBECS_Land.

Collapse SectorsProducedBy and ConsumedBy into single sector as dictated by the method

For the common use case of satellite tables in USEEIO, the need is generally to have one sector associated with a flow record.

Remove class param from flowbyactivity

The class parameter and in a list format that is awkward and perhaps not necessary to have in fba. We could easily remove the class requirement and users could apply a simple subset command if they wanted to get only one class.

dynamically_import_fxn breaking for stewi

Accessing the FBS_outside_flowsa functions through the dynamic import is failing for stewi related data. I think the yaml's need to be revised

ModuleNotFoundError: No module named 'flowsa.data_source_scripts.stewicombo'

Allow for more than 2 FBAs in allocation of an FBS activity set

Current method only allows for "allocation_source" and "helper_source" FBAs for FBS activity set allocation. Modify the FBS yaml to allow for unlimited FBAs to be called for allocation. Changing methodology will make methodology more transparent and limit the number of FBAs hardcoded into cleaning functions.

Import error: No module named 'stewi.globals'; 'stewi' is not a package

After installing with pip, attempting to access from flowsa:
...
File "C:\Users\cbirney\git_projects\flowsa\flowsa\stewi.py", line 31, in stewicombo_to_sector
import stewicombo
File "C:\Users\cbirney\AppData\Local\Programs\Python\Python37\lib\site-packages\stewicombo_init_.py", line 5, in
from stewicombo.overlaphandler import aggregate_and_remove_overlap
File "C:\Users\cbirney\AppData\Local\Programs\Python\Python37\lib\site-packages\stewicombo\overlaphandler.py", line 3, in
from stewi.globals import log
ModuleNotFoundError: No module named 'stewi.globals'; 'stewi' is not a package

It appears that duplicate globals.py could be causing an issue here?
cc: @catherinebirney

Add stewi requirement to setup and requirements

Since stewi is called it needs to be added. See the fedelemflowlist example for how to add a github package. Make sure the specific release is specified and not a >=.

Turn off logging from http requests

export log to txt

Given the amount of information in the logger, it would be nice to save this output after generating an FBS

FBS: some NAICS dropping

When expanding the NAICS list, more detailed sectors will get dropped if another entry in the mapping already maps to those sectors. Line 119-120

With NEI data, some activities are more specific than others. E.g. an SCC might apply to all Agriculture (NAICS: 111) while another SCC will apply to specific crops (NAICS: 11114). When both are present, the first SCC won't get assigned to 11114 because it drops out from the mapping.

My preference would be to exclude this step but I dont know how that would impact other mappings. @catherinebirney

Possible infinite loop in common.py

In common.find_true_file_path(), if the directory, filename, and extension do not combine into a valid path, and if removing chunks of the filename following _ does not lead to a valid path (e.g. due to a typo, or an error in the directory or extension), the function can get stuck in an infinite loop.

flowbysector: FlowType needs to be correctly assigned

The Water_national_2015_m1 method is returning None for this field

FlowType needs to be one of the acceptable values; see the format specs

Water use m2 flowbysector yielding multiple NAICS levels despite target_sector_level is NAICS_6

includes for instance 2211 and 1125

applies to https://github.com/USEPA/flowsa/blob/master/flowsa/data/flowbysectormethods/Water_national_2010_m2.yaml
and
https://github.com/USEPA/flowsa/blob/master/flowsa/data/flowbysectormethods/Water_national_2015_m2.yaml

Add table of available flowbysector files to README

Need a table like that for flowbyactivity of all available flowbysector

Permit user passed FBS YAML

Allow flow-by-sector YAMLs to be kept outside of the package and passed for processing in a getFlowBySector() call. This will allow development of FBS methods outside of the main flowsa repo

pyarrow dependency conflicts

The pyarrow dependency conflicts with some other EPA packages. I'm not even sure its needed anymore with esupy

parquet version number based on user's egg-info

The version number attached to a FBA or FBS parquet is based on the version in a user's directory flowsa.egg-info/PKG-INFO. The pkg-info is created when a user runs pip installl -e flowsa.

To update egg-info, a user can run: python setup.py bdist_egg

Consider making an updateable version parameter that is called on when naming parquets?

@WesIngwersen @bl-young

parquet not readable across OS

Reported by @MoLi7
Read in of flowbyactivity parquet files failing with footer error when created by another user

Datachecks expecting files in output folder for comparison

Error occurs here

flowsa/flowsa/datachecks.py

Line 398 in c6032b8

 df_merge.to_csv(outputpath + "FlowBySectorMethodAnalysis/" + method_name + '_' + source_name + 

Flowbysector calls this function and it fails ..user would not be expected to have the file it looks for in the output folder

add exception for missing url

What is the case where the url_list is going to be empty?

Should this throw some kind of exception?

This check may be better off in assemble_urls or in main()

flowsa/flowsa/flowbyactivity.py

Line 124 in e81ecbe

if url_list[0] is not None:

Joblib cache arbitrarily located, can fail

If the joblib cache is located in '.cache', then it attempts to create the '.cache' directory in the current working directory, wherever that may be. If the user does not have permissions there, then importing flowsa fails.

Set requirements to specific versions

Per @a-w-beck's recommendation, set the packages in requirements.txt and setup.py to specific version numbers, rather than using ">=" or "<" to prevent flowsa code from breaking

Inconsistent use of source names

Ideally the sources need to be consistent across these uses:
in fba data 'SourceName' column
in the file names of the parquet
in the Crosswalk file names
in the Source Catalog

The major issue is that we have generally have a provider, an inventory/report, and a specific table or flow type.

Make FBA cleaning functions optional in yaml

Currently these two parameters need to be set to None; if they are not present a KeyError is raised. As is done for some other parameters, the presence of these specs could be tested first. This would reduce unnecessary information in the FBS method files.

https://github.com/USEPA/flowsa/tree/master/flowsa/data/flowbysectormethods#source-specifications-in-fba-or-fbs-format

	# use NAICS from facility matcher so drop them here
	facility_mapping.drop(columns=['NAICS'], inplace=True)
	# merge dataframes to assign facility information based on facility IDs
	df = pd.merge(df, facility_mapping, how='left',
	on='FacilityID')

	all_NAICS = obtain_NAICS_from_facility_matcher(inventory_list)
	df = pd.merge(df, all_NAICS, how='left', on=['FRS_ID', 'Source'])