sergiocorreia / ftools Goto Github PK

Fast Stata commands for large datasets

License: MIT License

Stata 99.91% TeX 0.09%

stata data-manipulation factor mata egen collapse merge

ftools's Issues

parallel_map crashes on computers with slow temp folder

parallel_map runs into frequent supposed "syntax" errors that crashes the program. Changing the sleep time in line 348 from 50 to e.g. 500 solves this problem. This is important for legacy servers which, while having slow drives for the temporary files folder, still have many cores that are useful to speed up simulations.

I propose adding a [sleep(integer 50)] option in the syntax, with default 50 as is now. Should I do a pull request?

Stata 17, MP4.
Windows Server 2019.

fmerge overwrites master dataset's xtset

Minor note, unlike Stata's default behavior, fmerge will overwrite the xtset of the master dataset with the xtset of the using dataset.

error when running fsort twice

Thanks for this excellent set of programs, but when I run the commond fsort twice there was an error like this:
. sysuse auto,clear
(1978 Automobile Data)

. fsort turn

. fsort turn
st_store(): 3200 conformability error
: - function returned error
r(3200);
Why does this happen?

join does not clear sortedby macro

With the following code

set seed 42
clear
set obs 20
gen x = _n
gen r = runiform()
drop if r > 0.5
sort r
drop r
tempfile x
save "`x'"

clear
set obs 20
gen y = _n
drop if runiform() > 0.5
sort y
join, by(y=x) from("`x'")
desc
disp "`:sortedby'"
l

shows the data is sorted by y. However, this is only the case for master/matched observations, which appear first and sorted. Unmatched using observations appear last and unsorted. Ideally sortedby would get cleared after join is run if the results will not be sorted.

which join gives *! version 2.36.1 13feb2019
which ftools gives *! version 2.42.0 28dec2020

improve 'ftools compile'

ftools xyz is treated as ftools compile instead of raising an error
similarly, ftools check stays silent, but ftools check, v (nonexisting option) ends up as ftools compile
if the ado/plus/l folder does not exist, we should try to create it before saving to it
the abcreg source code has a more robust variant of the ftools.ado code
expose ftools check on the help file

fcollapse with any missing weights returns all missing

Consider

clear
set obs 10
gen x = _n
gen y = 1
replace y = . if mod(_n, 3) == 0
gen g = mod(_n, 5)

Missing weights should be dropped, but instead all the results are missing

. which fcollapse
/home/mauricio/ado/plus/f/fcollapse.ado
*! version 2.24.1 15jan2018

.     preserve

.         fcollapse x [fw = y], by(g)

.         l

     +-------+
     | g   x |
     |-------|
  1. | 0   . |
  2. | 1   . |
  3. | 2   . |
  4. | 3   . |
  5. | 4   . |
     +-------+

.     restore

.     collapse x [fw = y], by(g)

.     l

     +---------+
     | g     x |
     |---------|
  1. | 0   7.5 |
  2. | 1     1 |
  3. | 2   4.5 |
  4. | 3     8 |
  5. | 4     4 |
     +---------+

Switch compile to parsetools

https://github.com/sergiocorreia/parsetools

Add check in panelsum() when factors are sorted

This would speed up F.sort() calls when factors are sorted in the dataset; particularly useful if we run this method a lot (e.g. reghdfe)

First, create .is_sorted

Then, intercept this loop and replace (not tested):

p[index[level] = index[level] + 1] = obs

with

p[idx = index[level] = index[level] + 1] = obs
if (is_sorted) {
    if (idx < last_idx)) is_sorted = 0 // set is_sorted = 1 before the loop
    last_idx = idx // initially set last_idx = 0
}

Also benchmark it to see if the slowdown is high (in which case we make the sort check optional and unroll the loop)

Finally, sort() and _sort() should add a line like if (is_sorted) return(data)

fcollapse fails with string by variables (Stata 14/MP)

. sysuse auto, clear
(1978 Automobile Data)

. version 14

. fcollapse price, by(make)
      st_varvaluelabel():   181  Stata returned error
    Factor::store_keys():     -  function returned error
            f_collapse():     -  function returned error
                 <istmt>:     -  function returned error
r(181);

data type following fcollapse

. which ftools
/home/asheph/ado/plus/f/ftools.ado
*! version 2.24.0 11jan2018

Stata's own collapse command will promote the storage type of variables. So, if I have

set obs 1000
gen byte x = 1
collapse (sum) x

then the sum x will be promoted to double with a value of 1000 returned.

With fcollapse the storage type of the returned variable x is not promoted from byte, in which case a missing value is returned.

error with join

The command -join- appears to fail with the error message 3598 when the master dataset is a tempfile. See reproducible example below. I am running Stata 14.0 on mac OS X 10.13.6.

sysuse auto , clear
generate id = _n
preserve
keep id make price mpg
tempfile tmp
save "tmp'" restore keep id foreign join , into("tmp'") by(id)

Support many to many join

Would it be possible to add many to many functionality in the join command that could mimic the joinby command?

fegen group behaviour with missing grouping values

Firstly, thanks for this excellent set of programs.

. which ftools
/home/anon/ado/plus/f/ftools.ado
*! version 2.9.2 06apr2017

Suppose we have the data set:

Using egen group will return a missing group value if the grouping values have a missing value. Using fegen group does not distinguish between missing and non-missing values. So, running,

fegen group1 = group(hhid pid)
egen group2 = group(hhid pid)

produces the data set:

hhid	pid	group1	group2
1	1	1	1
2	1	2	2
2	1	2	2
2		3	
3	1	4	3
3	2	5	4
4		6	
4		6

which may cause problems if the users program expects the same behaviour as the stata command.

Best, Andrew

join: do not copy certain chars

When copying labels/notes from -using- to -master-, avoid copying _dta[...], in particular _TStvar _TSpanel _TSdelta _TSitrvl tis iis (or maybe no _dta at all?)

Otherwise, running -join- messes up -xtset-

join: assert that by() vars have same general type (str vs num)

EG:


key variable id1 is str5 in master but byte in using data
key variable id3 is str12 in master but int in using data
    Each key variable -- the variables on which observations are matched -- must be of the same generic type
    in the master and using datasets.  Same generic type means both numeric or both string.

Adding update / replace to fmerge

Would it be possible to extend fmerge to allow for update or replace?

fmerge: 1:1 merge, error 3498, <id1 id2> do not uniquely identify obs. in the master data

Hello,

A few times I have accidentally performed a 1:1 merge using fmerge when I should have performed a m:1 merge. I then get error 3498. However, rather than failing to merge (as would happen, I think, with the standard merge command) fmerge seems to perform a correct m:1 merge anyway. Am I correct that this happens? I wonder if this is a feature or a bug?

Thanks for your help!

fmerge / join changes using-keys>100 to missing

Hi Sergio,

I found a rather weird bug in the fmerge/join command:

Let's say I have a numeric ID ranging from 1 to X where X>100. In the Master data, the ID is always <100 but in the Using data the ID can be > 100. The fmerge/join works, however, it will change all IDs > 100 to missing in the final data. This behavior does happen in all join into/from combinations.

I attached an example code and data.

Best,

Chris

bugexample.zip

(ftools header: *! version 2.37.0 16aug2019,
join header: *! version 2.36.1 13feb2019,
fmerge header: *! version 2.10.0 3apr2017,
stata version 15 mp,
mac osx
)

Possible issue with fsort when clearing sort variable

In testing hashsort, I found that fsort sometimes did not give me an identical data set compared to sort, stable. I cannot replicate this from a blank session very easily, so here is the data that gives the issue:

local addr https://raw.githubusercontent.com/mcaceresb/stata-gtools
local path develop/src/github-issues/

use `addr'/`path'/fsort_share.dta
sort int1, stable
tempfile cmp
save `cmp'

use `addr'/`path'/fsort_share.dta
fsort int1
cf * using `cmp'

The result is

. cf * using `cmp'
           rsort:  1 mismatch
r(9);

I believe the issue is with Andrew Maurer's trick to clear : sortedby. I got around this by setting obs to =_N + 1, manipulating the last observation, and dropping it. This way the origina data is never altered.

fmerge fails if there is a space in the path

fmerge id using "/some path with spaces/somefile" will fail due to insufficient quotes (we should quote everything that touches the -filename- local (including instances of -use-)

Ftools might not autocompile the mlib after a new version

join: problems with spaces in filepath

Hi!
Join will throw an error if the filepath for your tempfiles contains spaces. This can be fixed by enclosing "form", "into" and "filename" on lines 24 and 65 of join.ado in additional quotes.

Thanks so much for ceating this great set of tools - it already saved me hours of time!

Treat negative values in verbose(#) as if they were zero instead of positive

fcollapse issue with double identifier

Hi Sergio,

First - thanks very much for writing the gtools and ftools packages. Incredibly useful and a great public service!!!

I have found that with fcollapse, the results of the collapse are likely to be incorrect if the identifier is a very long double (in my example, census block identifiers). Another reminder that one should always use strings for identifiers, but I wanted to flag this issue. This issue does not show up using collapse or gcollapse.

Thanks -- Nate

fcollapse incorrectly parses negative weights

clear
set obs 1
gen x = 1
gen y = -1
preserve
    fcollapse (p30) p30 = x (p50) p50 = x (p70) p70 = x [fw = y]
restore, preserve
    fcollapse (p30) p30 = x (p50) p50 = x (p70) p70 = x [pw = y]
restore, preserve
    fcollapse (p30) p30 = x (p50) p50 = x (p70) p70 = x [aw = y]
restore

With collapse, all those instances throw errors. Further,

set seed 42
clear
set obs 10
gen x = _n
gen y = int(10 * rnormal())
l
preserve
    fcollapse (p10) p10 = x (p50) p50 = x (p90) p90 = x [iw = y]
    l
restore, preserve
    collapse (p10) p10 = x (p50) p50 = x (p90) p90 = x [iw = y]
    l
restore

Produces

     +-----------------+
     | p10   p50   p90 |
     |-----------------|
  1. |  10     1     1 |
     +-----------------+

     +-----------------+
     | p10   p50   p90 |
     |-----------------|
  1. |   1     5    10 |

I think you are running into the same issue I did, and are forgetting to normalize the weights. For percentiles, collapse multiplies the weights by number non-missing / sum weights before computing them. This gives the right answer:

 qui sum y, meanonly
 gen ynorm = y / `r(sum)'
 fcollapse (p10) p10 = x (p50) p50 = x (p90) p90 = x [iw = ynorm]
 l

Running ftools commands such as `join` can clear mata objects unrelated to ftools

ftools/src/ftools.mata

Line 13 in aaeba12

mata: mata clear

Solution: only drop as needed

Command join is unrecognised

Hi,

first of all, thanks a lot for both ftools and reghdfe! These are superb tools...

I have an issue using Stata 14 and using fmerge: I get an error r(199) stating "command join is unrecognised".
I have ftools also running on a Stata 13 version and there I have no issues.
All the best,
Glenn

join: error when using labels and key has different name in using

EG:

join  , from("using_data") by(country_id=id_country)

If using has labels in the country id, Mata will search the wrong variable name (the one in using and not master?) and thus give the error

st_varvaluelabel():  3500  invalid Stata variable name

Dict size exceeds Mata limits?

Hi Sergio,
I'm running into a dict size exceeds Mata limits error running reghdfe on a pretty large dataset, and found this error built into the ftools .ado file.

According to the Stata documentation for the release of Stata 16, though, "Mata matrices remain limited only by memory," so I was wondering if there's a remaining reason for the hard-coded limit in place here?

Thanks!

fmerge error

I just received the following error when using fmerge. I suspect I hit some sort of memory limit.

fmerge m:1 perm using "`permutations'", assert(match) nogen
                  join():  3900  unable to allocate real <tmp>[403200000,9]
                 <istmt>:     -  function returned error
r(3900);

Here's version number:

which ftools             
/home/wtownsend/ado/plus/f/ftools.ado
SMP Mon Jun 25 08:07:07 PDT 
*! version 2.24.3 24jan2018

Here's the dofile, for context
https://pastebin.com/NixsYupi

Add help file for the extras ported from moresyntax

The ones of general interest are

ms_get_version
ms_compile_mata

The others are:

ms_fvstrip
ms_fvunab
ms_parse_absvars
ms_parse_varlist
ms_parse_vce

Error when merging m:1 on string variable

set obs 100
gen a = string(_n)
tempfile temp
save `temp'
fmerge m:1 a using `temp'

returns

assert_is_id():  3498  <a> do not uniquely identify obs. in the using data
                  join():     -  function returned error
                 <istmt>:     -  function returned error

fmerge error

Hi using stata 13.1 mp and facing the following error when running fmerge

Using stata 13.1mp

  is_integers_only():  3253  pk[325815,1] found where real required

              join():     -  function returned error

             <istmt>:     -  function returned error

sergiocorreia / ftools Goto Github PK

ftools's Issues

Recommend Projects

Recommend Topics

Recommend Org