sergiocorreia / ftools Goto Github PK
View Code? Open in Web Editor NEWFast Stata commands for large datasets
License: MIT License
Fast Stata commands for large datasets
License: MIT License
parallel_map runs into frequent supposed "syntax" errors that crashes the program. Changing the sleep time in line 348 from 50 to e.g. 500 solves this problem. This is important for legacy servers which, while having slow drives for the temporary files folder, still have many cores that are useful to speed up simulations.
I propose adding a [sleep(integer 50)] option in the syntax, with default 50 as is now. Should I do a pull request?
Stata 17, MP4.
Windows Server 2019.
Minor note, unlike Stata's default behavior, fmerge will overwrite the xtset of the master dataset with the xtset of the using dataset.
Thanks for this excellent set of programs, but when I run the commond fsort twice there was an error like this:
. sysuse auto,clear
(1978 Automobile Data)
. fsort turn
. fsort turn
st_store(): 3200 conformability error
: - function returned error
r(3200);
Why does this happen?
With the following code
set seed 42
clear
set obs 20
gen x = _n
gen r = runiform()
drop if r > 0.5
sort r
drop r
tempfile x
save "`x'"
clear
set obs 20
gen y = _n
drop if runiform() > 0.5
sort y
join, by(y=x) from("`x'")
desc
disp "`:sortedby'"
l
shows the data is sorted by y
. However, this is only the case for master/matched observations, which appear first and sorted. Unmatched using observations appear last and unsorted. Ideally sortedby would get cleared after join is run if the results will not be sorted.
which join
gives *! version 2.36.1 13feb2019
which ftools
gives *! version 2.42.0 28dec2020
ftools xyz
is treated as ftools compile
instead of raising an errorftools check
stays silent, but ftools check, v
(nonexisting option) ends up as ftools compile
abcreg
source code has a more robust variant of the ftools.ado
codeftools check
on the help fileConsider
clear
set obs 10
gen x = _n
gen y = 1
replace y = . if mod(_n, 3) == 0
gen g = mod(_n, 5)
Missing weights should be dropped, but instead all the results are missing
. which fcollapse
/home/mauricio/ado/plus/f/fcollapse.ado
*! version 2.24.1 15jan2018
. preserve
. fcollapse x [fw = y], by(g)
. l
+-------+
| g x |
|-------|
1. | 0 . |
2. | 1 . |
3. | 2 . |
4. | 3 . |
5. | 4 . |
+-------+
. restore
. collapse x [fw = y], by(g)
. l
+---------+
| g x |
|---------|
1. | 0 7.5 |
2. | 1 1 |
3. | 2 4.5 |
4. | 3 8 |
5. | 4 4 |
+---------+
This would speed up F.sort() calls when factors are sorted in the dataset; particularly useful if we run this method a lot (e.g. reghdfe)
First, create .is_sorted
Then, intercept this loop and replace (not tested):
p[index[level] = index[level] + 1] = obs
with
p[idx = index[level] = index[level] + 1] = obs
if (is_sorted) {
if (idx < last_idx)) is_sorted = 0 // set is_sorted = 1 before the loop
last_idx = idx // initially set last_idx = 0
}
Also benchmark it to see if the slowdown is high (in which case we make the sort check optional and unroll the loop)
Finally, sort()
and _sort()
should add a line like if (is_sorted) return(data)
. sysuse auto, clear
(1978 Automobile Data)
. version 14
. fcollapse price, by(make)
st_varvaluelabel(): 181 Stata returned error
Factor::store_keys(): - function returned error
f_collapse(): - function returned error
<istmt>: - function returned error
r(181);
. which ftools
/home/asheph/ado/plus/f/ftools.ado
*! version 2.24.0 11jan2018
Stata's own collapse
command will promote the storage type of variables. So, if I have
set obs 1000
gen byte x = 1
collapse (sum) x
then the sum x
will be promoted to double
with a value of 1000
returned.
With fcollapse
the storage type of the returned variable x
is not promoted from byte
, in which case a missing value is returned.
The command -join- appears to fail with the error message 3598 when the master dataset is a tempfile. See reproducible example below. I am running Stata 14.0 on mac OS X 10.13.6.
sysuse auto , clear
generate id = _n
preserve
keep id make price mpg
tempfile tmp
save "tmp'" restore keep id foreign join , into("
tmp'") by(id)
Would it be possible to add many to many functionality in the join command that could mimic the joinby command?
Firstly, thanks for this excellent set of programs.
. which ftools
/home/anon/ado/plus/f/ftools.ado
*! version 2.9.2 06apr2017
Suppose we have the data set:
hhid pid
1 1
2 1
2 1
2
3 1
3 2
4
4
Using egen group
will return a missing group value if the grouping values have a missing value. Using fegen group
does not distinguish between missing and non-missing values. So, running,
fegen group1 = group(hhid pid)
egen group2 = group(hhid pid)
produces the data set:
hhid pid group1 group2
1 1 1 1
2 1 2 2
2 1 2 2
2 3
3 1 4 3
3 2 5 4
4 6
4 6
which may cause problems if the users program expects the same behaviour as the stata command.
Best, Andrew
When copying labels/notes from -using- to -master-, avoid copying _dta[...], in particular _TStvar _TSpanel _TSdelta _TSitrvl tis iis (or maybe no _dta at all?)
Otherwise, running -join- messes up -xtset-
EG:
key variable id1 is str5 in master but byte in using data
key variable id3 is str12 in master but int in using data
Each key variable -- the variables on which observations are matched -- must be of the same generic type
in the master and using datasets. Same generic type means both numeric or both string.
Would it be possible to extend fmerge
to allow for update
or replace
?
Hello,
A few times I have accidentally performed a 1:1 merge using fmerge when I should have performed a m:1 merge. I then get error 3498. However, rather than failing to merge (as would happen, I think, with the standard merge command) fmerge seems to perform a correct m:1 merge anyway. Am I correct that this happens? I wonder if this is a feature or a bug?
Thanks for your help!
Hi Sergio,
I found a rather weird bug in the fmerge/join command:
Let's say I have a numeric ID ranging from 1 to X where X>100. In the Master data, the ID is always <100 but in the Using data the ID can be > 100. The fmerge/join works, however, it will change all IDs > 100 to missing in the final data. This behavior does happen in all join into/from combinations.
I attached an example code and data.
Best,
Chris
(ftools header: *! version 2.37.0 16aug2019,
join header: *! version 2.36.1 13feb2019,
fmerge header: *! version 2.10.0 3apr2017,
stata version 15 mp,
mac osx
)
In testing hashsort, I found that fsort
sometimes did not give me an identical data set compared to sort, stable
. I cannot replicate this from a blank session very easily, so here is the data that gives the issue:
local addr https://raw.githubusercontent.com/mcaceresb/stata-gtools
local path develop/src/github-issues/
use `addr'/`path'/fsort_share.dta
sort int1, stable
tempfile cmp
save `cmp'
use `addr'/`path'/fsort_share.dta
fsort int1
cf * using `cmp'
The result is
. cf * using `cmp'
rsort: 1 mismatch
r(9);
I believe the issue is with Andrew Maurer's trick to clear : sortedby
. I got around this by setting obs
to =_N + 1
, manipulating the last observation, and dropping it. This way the origina data is never altered.
fmerge id using "/some path with spaces/somefile"
will fail due to insufficient quotes (we should quote everything that touches the -filename- local (including instances of -use-)
Hi!
Join will throw an error if the filepath for your tempfiles contains spaces. This can be fixed by enclosing "form", "into" and "filename" on lines 24 and 65 of join.ado in additional quotes.
Thanks so much for ceating this great set of tools - it already saved me hours of time!
Hi Sergio,
First - thanks very much for writing the gtools and ftools packages. Incredibly useful and a great public service!!!
I have found that with fcollapse, the results of the collapse are likely to be incorrect if the identifier is a very long double (in my example, census block identifiers). Another reminder that one should always use strings for identifiers, but I wanted to flag this issue. This issue does not show up using collapse or gcollapse.
Thanks -- Nate
clear
set obs 1
gen x = 1
gen y = -1
preserve
fcollapse (p30) p30 = x (p50) p50 = x (p70) p70 = x [fw = y]
restore, preserve
fcollapse (p30) p30 = x (p50) p50 = x (p70) p70 = x [pw = y]
restore, preserve
fcollapse (p30) p30 = x (p50) p50 = x (p70) p70 = x [aw = y]
restore
With collapse, all those instances throw errors. Further,
set seed 42
clear
set obs 10
gen x = _n
gen y = int(10 * rnormal())
l
preserve
fcollapse (p10) p10 = x (p50) p50 = x (p90) p90 = x [iw = y]
l
restore, preserve
collapse (p10) p10 = x (p50) p50 = x (p90) p90 = x [iw = y]
l
restore
Produces
+-----------------+
| p10 p50 p90 |
|-----------------|
1. | 10 1 1 |
+-----------------+
vs
+-----------------+
| p10 p50 p90 |
|-----------------|
1. | 1 5 10 |
I think you are running into the same issue I did, and are forgetting to normalize the weights. For percentiles, collapse
multiplies the weights by number non-missing / sum weights before computing them. This gives the right answer:
qui sum y, meanonly
gen ynorm = y / `r(sum)'
fcollapse (p10) p10 = x (p50) p50 = x (p90) p90 = x [iw = ynorm]
l
Line 13 in aaeba12
Solution: only drop as needed
Hi,
first of all, thanks a lot for both ftools and reghdfe! These are superb tools...
I have an issue using Stata 14 and using fmerge: I get an error r(199) stating "command join is unrecognised".
I have ftools also running on a Stata 13 version and there I have no issues.
All the best,
Glenn
EG:
join , from("using_data") by(country_id=id_country)
If using has labels in the country id, Mata will search the wrong variable name (the one in using and not master?) and thus give the error
st_varvaluelabel(): 3500 invalid Stata variable name
Hi Sergio,
I'm running into a dict size exceeds Mata limits
error running reghdfe
on a pretty large dataset, and found this error built into the ftools
.ado file.
According to the Stata documentation for the release of Stata 16, though, "Mata matrices remain limited only by memory," so I was wondering if there's a remaining reason for the hard-coded limit in place here?
Thanks!
I just received the following error when using fmerge. I suspect I hit some sort of memory limit.
fmerge m:1 perm using "`permutations'", assert(match) nogen
join(): 3900 unable to allocate real <tmp>[403200000,9]
<istmt>: - function returned error
r(3900);
Here's version number:
which ftools
/home/wtownsend/ado/plus/f/ftools.ado
SMP Mon Jun 25 08:07:07 PDT
*! version 2.24.3 24jan2018
Here's the dofile, for context
https://pastebin.com/NixsYupi
The ones of general interest are
ms_get_version
ms_compile_mata
The others are:
ms_fvstrip
ms_fvunab
ms_parse_absvars
ms_parse_varlist
ms_parse_vce
set obs 100
gen a = string(_n)
tempfile temp
save `temp'
fmerge m:1 a using `temp'
returns
assert_is_id(): 3498 <a> do not uniquely identify obs. in the using data
join(): - function returned error
<istmt>: - function returned error
Hi using stata 13.1 mp and facing the following error when running fmerge
Using stata 13.1mp
is_integers_only(): 3253 pk[325815,1] found where real required
join(): - function returned error
<istmt>: - function returned error
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.