Hi! I'm facing multiple memory issues when using CausalForestDML cla

High memory footprint for big dataframes in CausalForest model about econml HOT 3 OPEN

gabrieldaiha commented on July 28, 2024

High memory footprint for big dataframes in CausalForest model

from econml.

Comments (3)

kbattocchi commented on July 28, 2024

One simple comparison that would be useful is how the memory consumption of standard sklearn RandomForests compares on dataframes of the same size, since much of the EconML tree code was forked from sklearn (version 0.24, I believe).

Although 180GB does seem excessive, I don't think it is really exponential - if your input has 40M floating point values, the raw data for that alone is 320MB, so this is ~560 times the size of your dataset. Certainly if we can easily optimize things to bring this down we should, but it's not even quadratic in the number of elements.

You mention that memory is high for both fit and effect: do you mean that while running those methods memory usage spikes but then comes back down to a more reasonable amount when the method calls complete?

from econml.

gdaiha commented on July 28, 2024

You mention that memory is high for both fit and effect: do you mean that while running those methods memory usage spikes but then comes back down to a more reasonable amount when the method calls complete?

Yes, memory usage spikes, but then comes back down.

I'm trying to investigate better inside fit, but in predict_point_and_var, I identified that the spike of memory comes after the second Parallel call inside var condition, so I think memory spike is probably origined on these rows:

EconML/econml/grf/_base_grf.py

Lines 703 to 763 in db1e254

 moment = np.mean(moment_bags, axis=0) 

 trans_moment_bags = np.moveaxis(moment_bags, 0, -1) 

 sq_between = np.einsum('tij,tjk->tik', trans_moment_bags, 

 np.transpose(trans_moment_bags, (0, 2, 1))) / len(slices) 

 moment_sq = np.einsum('tij,tjk->tik', 

 moment.reshape(moment.shape + (1,)), 

 moment.reshape(moment.shape[:-1] + (1, moment.shape[-1]))) 

 var_between = sq_between - moment_sq 

 pred_cov = np.einsum('ijk,ikm->ijm', invjac, 

 np.einsum('ijk,ikm->ijm', var_between, np.transpose(invjac, (0, 2, 1)))) 

 if project: 

 pred_var = np.einsum('ijk,ikm->ijm', projector.reshape((-1, 1, projector.shape[1])), 

 np.einsum('ijk,ikm->ijm', pred_cov, 

 projector.reshape((-1, projector.shape[1], 1))))[:, 0, 0] 

 else: 

 pred_var = np.diagonal(pred_cov, axis1=1, axis2=2) 

 ##################### 

 # Variance correction 

 ##################### 

 # Subtract the average within bag variance. This ends up being equal to the 

 # overall (E_{all trees}[moment^2] - E_bags[ E[mean_bag_moment]^2 ]) / sizeof(bag). 

 # The negative part is just sq_between. 

 var_total = np.mean(moment_var_bags, axis=0) 

 correction = (var_total - sq_between) / (len(slices[0]) - 1) 

 pred_cov_correction = np.einsum('ijk,ikm->ijm', invjac, 

 np.einsum('ijk,ikm->ijm', correction, np.transpose(invjac, (0, 2, 1)))) 

 if project: 

 pred_var_correction = np.einsum('ijk,ikm->ijm', projector.reshape((-1, 1, projector.shape[1])), 

 np.einsum('ijk,ikm->ijm', pred_cov_correction, 

 projector.reshape((-1, projector.shape[1], 1))))[:, 0, 0] 

 else: 

 pred_var_correction = np.diagonal(pred_cov_correction, axis1=1, axis2=2) 

 # Objective bayes debiasing for the diagonals where we know a-prior they are positive 

 # The off diagonals we have no objective prior, so no correction is applied. 

 naive_estimate = pred_var - pred_var_correction 

 se = np.maximum(pred_var, pred_var_correction) * np.sqrt(2.0 / len(slices)) 

 zstat = naive_estimate / np.clip(se, 1e-10, np.inf) 

 numerator = np.exp(- (zstat**2) / 2) / np.sqrt(2.0 * np.pi) 

 denominator = 0.5 * erfc(-zstat / np.sqrt(2.0)) 

 pred_var_corrected = naive_estimate + se * numerator / denominator 

 # Finally correcting the pred_cov or pred_var 

 if project: 

 pred_var = pred_var_corrected 

 else: 

 pred_cov = pred_cov - pred_cov_correction 

 for t in range(self.n_outputs_): 

 pred_cov[:, t, t] = pred_var_corrected[:, t] 

 if project: 

 if point: 

 pred = np.sum(parameter * projector, axis=1) 

 if var: 

 return pred, pred_var 

 else: 

 return pred 

 else: 

 return pred_var

from econml.

gdaiha commented on July 28, 2024

Another important detail. I was using a treatment dataframe with featurizer, making me have 6 columns in T. I was inspecting code, and, in many steps, they use a cross product of T over T. I think this is contributing for this memory spike too.

from econml.

High memory footprint for big dataframes in CausalForest model about econml HOT 3 OPEN

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	moment = np.mean(moment_bags, axis=0)

	trans_moment_bags = np.moveaxis(moment_bags, 0, -1)
	sq_between = np.einsum('tij,tjk->tik', trans_moment_bags,
	np.transpose(trans_moment_bags, (0, 2, 1))) / len(slices)
	moment_sq = np.einsum('tij,tjk->tik',
	moment.reshape(moment.shape + (1,)),
	moment.reshape(moment.shape[:-1] + (1, moment.shape[-1])))
	var_between = sq_between - moment_sq
	pred_cov = np.einsum('ijk,ikm->ijm', invjac,
	np.einsum('ijk,ikm->ijm', var_between, np.transpose(invjac, (0, 2, 1))))

	if project:
	pred_var = np.einsum('ijk,ikm->ijm', projector.reshape((-1, 1, projector.shape[1])),
	np.einsum('ijk,ikm->ijm', pred_cov,
	projector.reshape((-1, projector.shape[1], 1))))[:, 0, 0]
	else:
	pred_var = np.diagonal(pred_cov, axis1=1, axis2=2)

	#####################
	# Variance correction
	#####################
	# Subtract the average within bag variance. This ends up being equal to the
	# overall (E_{all trees}[moment^2] - E_bags[ E[mean_bag_moment]^2 ]) / sizeof(bag).
	# The negative part is just sq_between.
	var_total = np.mean(moment_var_bags, axis=0)
	correction = (var_total - sq_between) / (len(slices[0]) - 1)
	pred_cov_correction = np.einsum('ijk,ikm->ijm', invjac,
	np.einsum('ijk,ikm->ijm', correction, np.transpose(invjac, (0, 2, 1))))
	if project:
	pred_var_correction = np.einsum('ijk,ikm->ijm', projector.reshape((-1, 1, projector.shape[1])),
	np.einsum('ijk,ikm->ijm', pred_cov_correction,
	projector.reshape((-1, projector.shape[1], 1))))[:, 0, 0]
	else:
	pred_var_correction = np.diagonal(pred_cov_correction, axis1=1, axis2=2)
	# Objective bayes debiasing for the diagonals where we know a-prior they are positive
	# The off diagonals we have no objective prior, so no correction is applied.
	naive_estimate = pred_var - pred_var_correction
	se = np.maximum(pred_var, pred_var_correction) * np.sqrt(2.0 / len(slices))
	zstat = naive_estimate / np.clip(se, 1e-10, np.inf)
	numerator = np.exp(- (zstat*2) / 2) / np.sqrt(2.0 np.pi)
	denominator = 0.5 * erfc(-zstat / np.sqrt(2.0))
	pred_var_corrected = naive_estimate + se * numerator / denominator

	# Finally correcting the pred_cov or pred_var
	if project:
	pred_var = pred_var_corrected
	else:
	pred_cov = pred_cov - pred_cov_correction
	for t in range(self.n_outputs_):
	pred_cov[:, t, t] = pred_var_corrected[:, t]

	if project:
	if point:
	pred = np.sum(parameter * projector, axis=1)
	if var:
	return pred, pred_var
	else:
	return pred
	else:
	return pred_var