Git Product home page Git Product logo

Comments (7)

mikaylagawarecki avatar mikaylagawarecki commented on September 24, 2024

cc @clee2000 since you added 3b7d60b

from pytorch.

clee2000 avatar clee2000 commented on September 24, 2024

I can move the option to save xml into an argument. The paths to the xmls are generated by run_test.py though, is that ok?

from pytorch.

Flamefire avatar Flamefire commented on September 24, 2024

I can move the option to save xml into an argument. The paths to the xmls are generated by run_test.py though, is that ok?

You mean this code here:

def get_report_path(argv=UNITTEST_ARGS, pytest=False):
test_filename = sanitize_test_filename(argv[0])
test_report_path = TEST_SAVE_XML + LOG_SUFFIX
test_report_path = os.path.join(test_report_path, test_filename)
if pytest:
test_report_path = test_report_path.replace('python-unittest', 'python-pytest')
os.makedirs(test_report_path, exist_ok=True)
test_report_path = os.path.join(test_report_path, f"{test_filename}-{os.urandom(8).hex()}.xml")
return test_report_path
os.makedirs(test_report_path, exist_ok=True)
return test_report_path

To me it looks like all XML files go into the folder passed via --save-xml, but might be in subfolders and might have a random name. Unless --log-suffix is used (e.g.

command = [sys.executable] + argv + [f'--log-suffix=-shard-{i + 1}'] + test_batches[i]
) where a string is appended to the folder name. Still usable as --save-xml=/tmp/test-reports/report could be used and then all XMLs are in /tmp/test-reports

Main question is: Is that even a feasible approach? I.e. does this work to get all failed tests, especially with the retry logic where I'm not sure if it will skip the "consistently failing" test and still try all others (before and after that failing one)

As those XMLs doesn't seem to used (otherwise it would have been already noticed that it got broken) is there any other/better approach?

from pytorch.

clee2000 avatar clee2000 commented on September 24, 2024

I'm thinking something more along the lines of #126690 Ignore my previous comment as well

Have you tried using --save-xml as an additional unittest arg?

Our CI makes xmls and saves them to s3, you can download them on HUD to see what format and file paths they take. There are already a couple of scripts in the repository that parse the xmls to get certain information, you can probably copy a bunch of code from that

It doesn't get all failed tests, there are some cases when xmls don't get made #123882 (but I think the log parsing method you used previously would have also not been able to handle this), but it will handle most retries correctly, you will have to dedup based on test name though.

I guess it's also worth noting that you could parse the logs since all the test names should be there if you use the verbose option?

from pytorch.

Flamefire avatar Flamefire commented on September 24, 2024

Our CI makes xmls and saves them to s3

Where does that happen, i.e. where are the xmls created?

It doesn't get all failed tests, there are some cases when xmls don't get made #123882

There seem to be 2 issues: Force-Terminated tests and reruns. I'd expect the XML files to contain also the rerun tests even currently as the file names get randomized.

but I think the log parsing method you used previously would have also not been able to handle this

IIRC when pytest-rerunfailures was used (plain, prior to 3b7d60b) then there was still a full summary line for each test file like 2 failed, 128 passed, 2 rerun which we could use.
For force-terminated files we treated it as a hard failure or excluded the test(s)

I guess it's also worth noting that you could parse the logs since all the test names should be there if you use the verbose option?

We actually do that already to be able to print individual failed tests in addition to the summaries (per test file and total). It even allowed to validate that we got all individual failed tests by checking against the summary. But that already turned out to be error-prone due to the different formats. Some examples:

    # === FAIL: test_add_scalar_relu (quantization.core.test_quantized_op.TestQuantizedOps) ===
    # --- ERROR: test_all_to_all_group_cuda (__main__.TestDistBackendWithSpawn) ---
    # FAILED test_ops_gradients.py::TestGradientsCPU::test_fn_grad_linalg_det_singular_cpu_complex128 - [snip]
    # FAILED [22.8699s] test_sparse_csr.py::TestSparseCompressedCPU::test_invalid_input_csr_large_cpu - [snip]

from pytorch.

clee2000 avatar clee2000 commented on September 24, 2024

The save xml argument is set by default by the CI environment variable, which is true in our CI. I guess the other option would be for you to set it as well, but it also turns on a couple of other things so you'd have to be careful

I'm not sure what a force terminated test is but the case where it doesn't generate xml is when a test segfaults, which leads to xml not getting created and also the === x failed, y skipped === line not getting printed. Then the rerun starts from the segfaulted test so all information about the tests before the segfaulted test gets lost. I can't remember what the previous behavior was, but depending on how far back it was, I'm pretty sure the previous test information still gets lost. Other than this, the reruns should get xml, its just that it'll be in a different file so its a bit more parsing

I edited my comment because I realized I understood the usage of the save xml argument incorrectly so I'm not sure if you saw this, but have you tried using --save-xml as an additional unittest arg? nvm this also doesn't work the way I think it does

from pytorch.

Flamefire avatar Flamefire commented on September 24, 2024

The save xml argument is set by default by the CI environment variable, which is true in our CI. I guess the other option would be for you to set it as well, but it also turns on a couple of other things so you'd have to be careful

Is that export CI=true set by e.g. GitHub? I found

pytorch/test/run_test.py

Lines 993 to 997 in b36e018

if IS_CI:
# Add the option to generate XML test report here as C++ tests
# won't go into common_utils
test_report_path = get_report_path(pytest=True)
pytest_args.extend(["--junit-xml-reruns", test_report_path])
in run_test setting --junit-xml-reruns but the same flag is set in common_utils only when --save-xml is passed https://github.com/pytorch/pytorch/blob/b36e01801b89a516f4271f796773d5f4b43f1186/torch/testing/_internal/common_utils.py#L1108-L1111 and the former only applies to tests run with pytest. So it looks incomplete to me.

I'm not sure what a force terminated test is but the case where it doesn't generate xml is when a test segfaults, which leads to xml not getting created

Yes I meant test (files) terminated by a signal like SIGSEGV but also SIGIOT (which we have observed)

all information about the tests before the segfaulted test gets lost. I can't remember what the previous behavior was, but depending on how far back it was, I'm pretty sure the previous test information still gets lost.

Yes it was the same here: Most of the information was lost (e.g. that summary line) but some individual tests could be found by parsing the log, but that is fragile, see my previous comment

Other than this, the reruns should get xml, its just that it'll be in a different file so its a bit more parsing

Just be sure I understood this correctly: For segfaulted tests there won't be any xmls even with the new rerun system?

from pytorch.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.