spacemeshos / post Goto Github PK

View Code? Open in Web Editor NEW

19.0 19.0 20.0 3.78 MB

Spacemesh POST protocol implementation

License: MIT License

Go 96.70% Makefile 1.74% Pascal 1.56%

post's People

Contributors

Stargazers

Watchers

Forkers

tzdybal sudachen cocoonventures sevskii111 dshulyak garrettian beck-8 reythia zhiqiangxu fourierism 6block minerdao schinzelh hunjixin leveleven xchdata1

post's Issues

Build and produce artifacts for postcli

There is a big demand for postcli. We should include postcli as a binary so users don't have to compile it by hand.

We need to have a possibility to arbitrairly define start point for POST initialisation

When let's say one is renting GPUs for initialisation or NOT using storage on the same device then it's very inconvenient to hope that the process never restarts.

We need a way to specify --startFrom xyz where xyz is file,labels

Rewrite the proving algorithm in rust

The current implementation is limited by Golang possibilities #99
We have done PoC version in Rust.

We need to follow with full rust implementation of the algorithm and then integration of it to the Golang codebase.

Besides the implementation we need to make sure that we remember about all the needed improvements that is (but not limited to):

if proof not found then iterate again with another nonces
invalid disk reads #101

Fsync directory and a file after proof is written to disk

To guarantee that proof is persisted we need to:

fsync file (to sync inode and data)
directory (to sync file entry in a directory)

Otherwise, if the node crashes proof might be lost (depends if kernel was able to sync it or not in the background).

Support stand-alone execution mode with a GRPC API

Support building a stand-alone system process. In this mode, the process provides a GRPC interface for creating post init jobs which should include the single provider and multiple providers job. The CI should build this stand-alone process for all supported platforms / architectures (currently 3 targets).

@moshababo

inconsistent representation of commitment ATX and node ID

Problem statement

Node ID and commitment ATX ID are formatted in base64 both in postdata_metadata.json and in post-rs CLI initializer. However, postcli accepts them in hex. They are also formatted in hex when logged in go-spacemesh. It is inconsistent and causes extra work to convert.

DoD

We should format ATXes in the same way everywhere (either hex or base64 - this is to decide).

Add the POW to the AES invocations so running proving with multiple nonces is not that cheap.

Description

We need to make using multiple (100s of thousands) nonces in proving harder. Therefore we need to add POW to the initial phase of AES.

Acceptance criteria

Define POW algorithm together with the research team
- The goal is to prevent an adversary to grind on challenges for keys that would allow them to generate a valid PoST proof with without having the labels stored on disk.
Limit maximum nonce for a challenge to 40
- This does not mean that the reference implementation of PoST uses 40 nonces to look for indices that satisfy the difficulty threshold. Just that when verifying a proof by another identity nonces >= 40 will be rendered invalid

Remove go-spacemesh dependency

log: we can implement similar solution to PoET's log (based on btclog)
filesystem: looks like similar use-case as this. We can create a dedicated spacemesh util repo, or copy the code. @zalmen

post k2pow: change hashing algo and N

Description

quoting Iddo:

the scrypt+blake3 hash for k2pow will require that either all miners use GPU 
for proof generation, or lousy k1,k2 params and large ATX size for no good reason. 
The scrypt+sha3 that we use for labels is the worst thing that we can do because 
it has two extra disadvantages (efficient adversarial method for label computation 
will also break k2pow for free, and sha3 is inferior in general and we shouldn't use 
it in software ever).

We should just link to standard randomx repo (both rust crate and golang package 
are available),  it's easy to do in 5 minutes because it's totally unrelated to the GPU 
code, it's probably even easier than having two separate scrypt+sha3 functions 
and definitely easier than scrypt+blake3.

Proof generation / verification: allow cancellation

Description

Proof generation and verification should be able to be cancelled for multiple reasons:

Cycle gap of PoET passed and the node hasn't finished generating a proof. A late proof cannot be used anyway so generation should stop when the window passes.
Verification of proofs: incoming ATXs are processed in batches. Cancellation is needed among others to abort when shutting down a node.

Acceptance criteria

VerifyVRF, Verify and Generate can be passed a context that when cancelled aborts verification / generation of proof.
- Signal to cancel is forwarded to Rust code via FFI
Generate is passed a context with a timeout at the end of the cycle gap.
Verify and VerifyVRF is passed the "App-Context" that is canceled when the node is shut down.

Implementation hints

For VerifyVRF a passed context.Context that it is checked for before calling the oracle is sufficient.
For Verify a passed context.Context that t it checks before calling the underlying Rust code is sufficient as well
Generate already receives a context.Context that is not evaluated at the moment. A simple go-routine that that signals the rust code to stop when needed should be enough:

// returns wrapper for object from Rust
// that wraps FFI that needs to be defined
generator := postrs.Generate()
defer generator.Stop()

done := make(chan struct{})
var eg errgroup.Group
eg.Go(func() error {
	select {
	case <-done:
		return nil
	case <-ctx.Done():
		return generator.Stop() 
	}
})

proof, err := generator.Start()
close(done)
return proof, err

Remove confusing during nipost validation

> 2023-06-14T12:27:43.666+0100    INFO    02682.post      proving: generated proof        {"node_id": "02682d8d0eefae9596c65a9d9c4fac73576a6ecaaae236e167b4e7329512c7b9", "module": "post"}
2023-06-14T12:27:43.666+0100    DEBUG   02682.post      proving: generated proof        {"node_id": "02682d8d0eefae9596c65a9d9c4fac73576a6ecaaae236e167b4e7329512c7b9", "module": "post", "Nonce": 8, "Indices": "de76780728c2a9103883264000533cf1a0497ef9e2bd4fe5466fd12b0e912dd71cf693ec3fa0caff833896b621f12388464276b9bdc450c77d0a301a7f1945785236f105cd7742a90d0949c8a813ba4a455b74bb75e0a8f560ea116bbb54a7442cfe4f1117e7acce9ab45b994047c1d295b3a2e598411c6cecf65c581f56af33e931276dbe50cedd68127bd2c1f0b3b96300d574b78ae9ac0f", "K2PoW": 5764607523034254532}
2023-06-14T12:55:10.148+0100    DEBUG   02682.post      Initializing labels 1329758894..1329758895...   {"node_id": "02682d8d0eefae9596c65a9d9c4fac73576a6ecaaae236e167b4e7329512c7b9", "module": "post", "module": "post::initialize", "file": "src\\[initialize.rs](http://initialize.rs/)", "line": 108}
2023-06-14T12:55:10.149+0100    DEBUG   02682.post      Initializing labels 2288144977..2288144978...   {"node_id": "02682d8d0eefae9596c65a9d9c4fac73576a6ecaaae236e167b4e7329512c7b9", "module": "post", "module": "post::initialize", "file": "src\\[initialize.rs](http://initialize.rs/)", "line": 108}

that's what's logged during nipost. Should be removed because it's confusing.

CI fails on windows_latest image

testing on win image fails with errors

Alocate error

==2832==ERROR: ThreadSanitizer failed to allocate 0x000000999000 (10063872) bytes at 0x200db79ba0000 (error code: 87)

possible solution add this rows before run unit tests

- name: Install mingw 10.2.0
        if: matrix.os == 'windows-latest'
        run: choco install mingw --version 10.2.0 --allow-downgrade

actions/runner-images#5841

another error

c:/programdata/chocolatey/lib/mingw/tools/install/mingw64/bin/../lib/gcc/x86_64-w64-mingw32/10.2.0/../../../../x86_64-w64-mingw32/lib/../lib/libmsvcrt.a(/2203): duplicate symbol reference: _unlock_file in both libgcc(.text) and libgcc(.data)
libgcc(.text): relocation target atexit not defined

Implement prover

PoST package structure for v1.0

The package structure of PoST is quite confusing and could be improved for the next major release. I propose the following layout for clearer naming of types and generally reducing the number of individual packages a user of the library has to interact with:

post: The root package containing the module definition, the initializer, all types and functions associated with it and everything that is currently in config and shared and doesn't fit and one of the following packages
post/internal: Internal functionality not to be used directly by users of this library. This includes the bridges to the C and rust code and possibly other internal types and functions not intended to be available outside of the module
post/internal/oracle: The oracle should be an internal package since its use outside of generating and verifying proofs is limited and users of the library should call those functions over the oracle directly.
post/proof: Contains the Generate and Verify functions, as well as the Proof and Metadata types from shared.

With this structure it is unlikely that a user of the library has to import more than one package at a time, while at the moment proving and verifying also always require to also import config and shared both of which collide with packages with the same name in other modules. post.Config is more meaningful than config.Config, proving.Generate/ verifying.Verify are less clear compared to proof.Generate / proof.Verify.

Also there is a clear dependency chain between packages of these structure:

post/proof -> post -> post/internal

The former packages only import the later and not vice versa.

Integration of OpenCL initialiser into `post` and `go-spacemesh`

Description

The goal is to replace the existing gpu-post integration with the newly written OpenCL intitializer

Acceptance Criteria

Existing gpu-post integration is disabled, not removed (in case it is still needed). Switching between OpenCL and gpu-post is OK to require a rebuild of post.
Test if go-spacemesh can initialize using OpenCL instead of gpu-post

Leaking goroutines in tryNonces

Multiple miners in 0.2 devnet are leaking goroutines:

**goroutine profile: total 598
323 @ 0x43c745 0x407d0a 0x407ab5 0xbe1d87 0x472161
#	0xbe1d86	github.com/spacemeshos/post/proving.(*Prover).tryNonces.func1+0x66	/go/pkg/mod/github.com/spacemeshos/[email protected]/proving/proving.go:263

30 @ 0x43c745 0x44c6cf 0xe222f4 0x472161
#	0xe222f3	github.com/spacemeshos/go-spacemesh/p2p/net.(*MsgConnection).sendListener+0xf3	/go/src/github.com/spacemeshos/go-spacemesh/p2p/net/msgcon.go:157

23 @ 0x43c745 0x44c6cf 0xe2ce05 0xe23bea 0x472161
#	0xe2ce04	github.com/spacemeshos/go-spacemesh/p2p/net.(*udpConnWrapper).Read+0xc4			/go/src/github.com/spacemeshos/go-spacemesh/p2p/net/udp.go:402
#	0xe23be9	github.com/spacemeshos/go-spacemesh/p2p/net.(*MsgConnection).beginEventProcessing+0x89	/go/src/github.com/spacemeshos/go-spacemesh/p2p/net/msgcon.go:265**

Stack frames are accurate for current head of develop (5cf91a7)

Create a CLI wrapper for Initialize

We wish to be able to run "initialize" by post as CLI.
The CLI should be able to receive 3 arguments as follow:

--space - The total space for the init
--filesize - The size of each file in the init
--homedir - The target path to save the init files

Example:
./post-init --space=104876 --filesize=262144

This will create init files of size 262144 bytes each in the default home dir

return value: Exit code =0 on success

Verify post init-data feature

Add feature to verify an existing post data. User specifies the post params (including smesher id) and the method should run tests to verify that the post is valid for the smesher. The motivation of this is enabling users to verify if a post they created is valid without having to wait for the node to use it and to log errors in case it is invalid.

Panic: runtime error: invalid memory address or nil pointer dereference (initialization.(*Initializer).NumLabelsWritten(...))

For more information please check logs here:
Sentry Issue on Windows: SMAPP-14S
Sentry issue on Linux Sentry issue: SMAPP-15D

2023-01-06T10:51:05.417-0800	WARN	d898e.hare         	missed hare window, skipping layer	{"node_id": "d898e00cca6e1549a760df69947bc63c544313f30f8800e1ed0f48e5f9c7d160", "module": "hare", "layer_id": 5530, "name": "hare"}
	2023-01-06T10:51:05.417-0800	INFO	00000.defaultLogger	starting new grpc server with 7 registered service(s)
	2023-01-06T10:51:05.417-0800	INFO	app started
	2023-01-06T10:51:05.417-0800	INFO	00000.defaultLogger	starting new grpc server on :9092
	2023-01-06T10:51:06.256-0800	INFO	00000.defaultLogger	GRPC MeshService.CurrentLayer
	2023-01-06T10:51:06.261-0800	INFO	00000.defaultLogger	GRPC GlobalStateService.AccountDataQuery
	2023-01-06T10:51:06.268-0800	INFO	00000.defaultLogger	GRPC NodeService.Status
	2023-01-06T10:51:06.270-0800	INFO	00000.defaultLogger	GRPC SmesherService.SmesherID
	2023-01-06T10:51:06.297-0800	INFO	00000.defaultLogger	GRPC NodeService.Echo
	2023-01-06T10:51:06.299-0800	INFO	00000.defaultLogger	GRPC NodeService.Status
	2023-01-06T10:51:06.301-0800	INFO	00000.defaultLogger	GRPC NodeService.StatusStream
	2023-01-06T10:51:06.301-0800	INFO	00000.defaultLogger	GRPC NodeService.ErrorStream
	2023-01-06T10:51:06.304-0800	INFO	00000.defaultLogger	GRPC SmesherService.PostSetupStatus
	panic: runtime error: invalid memory address or nil pointer dereference
	[signal SIGSEGV: segmentation violation code=0x1 addr=0xa0 pc=0x107b105]
	
	goroutine 272 [running]:
	github.com/spacemeshos/post/initialization.(*Initializer).NumLabelsWritten(...)
		/home/runner/go/pkg/mod/github.com/spacemeshos/[email protected]/initialization/initialization.go:302
	github.com/spacemeshos/go-spacemesh/activation.(*PostSetupManager).Status(0xc0003e2480)
		/home/runner/work/go-spacemesh/go-spacemesh/activation/post.go:114 +0xa5
	github.com/spacemeshos/go-spacemesh/api/grpcserver.SmesherService.PostSetupStatus({{0x1b7b210, 0xc0003e2480}, {0x1b7f400, 0xc001258d20}, 0x3b9aca00}, {0xc0013f2510, 0xf}, 0x1c)
		/home/runner/work/go-spacemesh/go-spacemesh/api/grpcserver/smesher_service.go:173 +0x59
	github.com/spacemeshos/api/release/go/spacemesh/v1._SmesherService_PostSetupStatus_Handler.func1({0x1b7a138, 0xc000472c90}, {0x157a960?, 0xc000472c00})
		/home/runner/go/pkg/mod/github.com/spacemeshos/api/release/[email protected]/spacemesh/v1/smesher.pb.go:646 +0x78
	github.com/grpc-ecosystem/go-grpc-middleware/logging/zap.UnaryServerInterceptor.func1({0x1b7a138, 0xc000472c60}, {0x157a960, 0xc000472c00}, 0xc000946680, 0xc000111758)
		/home/runner/go/pkg/mod/github.com/grpc-ecosystem/[email protected]/logging/zap/server_interceptors.go:31 +0x115
	github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1({0x1b7a138?, 0xc000472c60?}, {0x157a960?, 0xc000472c00?})
		/home/runner/go/pkg/mod/github.com/grpc-ecosystem/[email protected]/chain.go:25 +0x3a
	github.com/grpc-ecosystem/go-grpc-middleware/tags.UnaryServerInterceptor.func1({0x1b7a138?, 0xc000472bd0?}, {0x157a960, 0xc000472c00}, 0xc000946680, 0xc0009466a0)
		/home/runner/go/pkg/mod/github.com/grpc-ecosystem/[email protected]/tags/interceptors.go:23 +0xa6
	github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1({0x1b7a138?, 0xc000472bd0?}, {0x157a960?, 0xc000472c00?})
		/home/runner/go/pkg/mod/github.com/grpc-ecosystem/[email protected]/chain.go:25 +0x3a
	github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1({0x1b7a138, 0xc000472bd0}, {0x157a960, 0xc000472c00}, 0xc001569a20?, 0x14e7420?)
		/home/runner/go/pkg/mod/github.com/grpc-ecosystem/[email protected]/chain.go:34 +0xbe
	github.com/spacemeshos/api/release/go/spacemesh/v1._SmesherService_PostSetupStatus_Handler({0x167b2a0?, 0xc001431530}, {0x1b7a138, 0xc000472bd0}, 0xc00027b030, 0xc001431050)
		/home/runner/go/pkg/mod/github.com/spacemeshos/api/release/[email protected]/spacemesh/v1/smesher.pb.go:648 +0x138
	google.golang.org/grpc.(*Server).processUnaryRPC(0xc0017a63c0, {0x1b846a0, 0xc000007d40}, 0xc000f17200, 0xc0014315c0, 0x252c3b8, 0x0)
		/home/runner/go/pkg/mod/google.golang.org/[email protected]/server.go:1340 +0xd23
	google.golang.org/grpc.(*Server).handleStream(0xc0017a63c0, {0x1b846a0, 0xc000007d40}, 0xc000f17200, 0x0)
		/home/runner/go/pkg/mod/google.golang.org/[email protected]/server.go:1713 +0xa2f
	google.golang.org/grpc.(*Server).serveStreams.func1.2()
		/home/runner/go/pkg/mod/google.golang.org/[email protected]/server.go:965 +0x98
	created by google.golang.org/grpc.(*Server).serveStreams.func1
		/home/runner/go/pkg/mod/google.golang.org/[email protected]/server.go:963 +0x28a
	2023-01-06T10:59:55.009-0800	INFO	00000.defaultLogger	App version: v0.2.20-beta.0. Git: 8c5a399 - 8c5a3991491b3663973766942d1b7e06fc194af8 . Go Version: go1.19.4. OS: linux-amd64 
	2023-01-06T10:59:55.009-0800	INFO	00000.defaultLogger	Welcome to Spacemesh. Spacemesh full node is starting...```

Disable looking for VRF nonce past initialized data when using `-from/toFile` flag

The postcli can be used to initialize only a subset of data using -fromFile and -toFile flags (for example when re-initializing corrupted/missing data on a different machine or splitting initialization between machines).

The current behavior is to continue looking for VRF nonce (which is very likely to happen if initializing only a subset). It should be possible to disable looking for VRF nonce past the requested size (either automatically when -fromFile or -toFileflag is used or via a flag like -disableEagerVrfSearch or something like that.

Post initialization stops at wrong places causing the post to be invalid.

Environment is macbook pro M1

Consider following logs:

3476:2023-03-29T15:01:06.595+0200	INFO	7b957.app.7b957.post	initialization: file #28 completed; number of labels written: 0	{"node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "app", "node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "post"}
23477:2023-03-29T15:01:06.595+0200	INFO	7b957.app.7b957.post	initialization: starting to write file #29; target number of labels: 625000, start position: 18125000	{"node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "app", "node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "post"}
23478:2023-03-29T15:01:06.595+0200	INFO	7b957.app.7b957.post	initialization: file #29 completed; number of labels written: 0	{"node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "app", "node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "post"}
23479:2023-03-29T15:01:06.595+0200	INFO	7b957.app.7b957.post	initialization: starting to write file #30; target number of labels: 625000, start position: 18750000	{"node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "app", "node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "post"}
23480:2023-03-29T15:01:06.595+0200	INFO	7b957.app.7b957.post	initialization: file #30 completed; number of labels written: 0	{"node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "app", "node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "post"}
23481:2023-03-29T15:01:06.595+0200	INFO	7b957.app.7b957.post	initialization: starting to write file #31; target number of labels: 625000, start position: 19375000	{"node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "app", "node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "post"}
[cut cut cut all was the same]
23600:2023-03-29T15:01:06.605+0200	INFO	7b957.app.7b957.post	initialization: file #90 completed; number of labels written: 0	{"node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "app", "node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "post"}
23601:2023-03-29T15:01:06.605+0200	INFO	7b957.app.7b957.post	initialization: starting to write file #91; target number of labels: 625000, start position: 56875000	{"node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "app", "node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "post"}
23602:2023-03-29T15:01:06.605+0200	INFO	7b957.app.7b957.post	initialization: file #91 completed; number of labels written: 0	{"node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "app", "node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "post"}
23603:2023-03-29T15:01:06.605+0200	INFO	7b957.app.7b957.post	initialization: starting to write file #92; target number of labels: 625000, start position: 57500000	{"node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "app", "node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "post"}
23604:2023-03-29T15:01:06.605+0200	INFO	7b957.app.7b957.post	initialization: file #92 completed; number of labels written: 0	{"node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "app", "node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "post"}
23605:2023-03-29T15:01:06.605+0200	INFO	7b957.app.7b957.post	initialization: starting to write file #93; target number of labels: 625000, start position: 58125000	{"node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "app", "node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "post"}
23606:2023-03-29T15:01:06.606+0200	INFO	7b957.app.7b957.post	initialization: file #93 completed; number of labels written: 0	{"node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "app", "node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "post"}
23607:2023-03-29T15:01:06.606+0200	INFO	7b957.app.7b957.post	initialization: starting to write file #94; target number of labels: 625000, start position: 58750000	{"node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "app", "node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "post"}
23608:2023-03-29T15:01:06.606+0200	INFO	7b957.app.7b957.post	initialization: file #94 completed; number of labels written: 0	{"node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "app", "node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "post"}
23609:2023-03-29T15:01:06.606+0200	INFO	7b957.app.7b957.post	initialization: starting to write file #95; target number of labels: 625000, start position: 59375000	{"node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "app", "node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "post"}
23610:2023-03-29T15:01:06.606+0200	INFO	7b957.app.7b957.post	initialization: file #95 completed; number of labels written: 0	{"node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "app", "node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "post"}
23611:2023-03-29T15:01:06.606+0200	INFO	7b957.app.7b957.post	initialization: starting to write file #96; target number of labels: 625000, start position: 60000000	{"node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "app", "node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "post"}
23612:2023-03-29T15:01:06.606+0200	INFO	7b957.app.7b957.post	initialization: file #96 completed; number of labels written: 0	{"node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "app", "node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "post"}
23613:2023-03-29T15:01:06.606+0200	INFO	7b957.app.7b957.post	initialization: starting to write file #97; target number of labels: 625000, start position: 60625000	{"node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "app", "node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "post"}
23614:2023-03-29T15:01:06.606+0200	INFO	7b957.app.7b957.post	initialization: file #97 completed; number of labels written: 0	{"node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "app", "node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "post"}
23615:2023-03-29T15:01:06.606+0200	INFO	7b957.app.7b957.post	initialization: starting to write file #98; target number of labels: 625000, start position: 61250000	{"node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "app", "node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "post"}
23616:2023-03-29T15:01:06.606+0200	INFO	7b957.app.7b957.post	initialization: file #98 completed; number of labels written: 0	{"node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "app", "node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "post"}
23617:2023-03-29T15:01:06.606+0200	INFO	7b957.app.7b957.post	initialization: starting to write file #99; target number of labels: 625000, start position: 61875000	{"node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "app", "node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "post"}
23618:2023-03-29T15:01:06.606+0200	INFO	7b957.app.7b957.post	initialization: file #99 completed; number of labels written: 0	{"node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "app", "node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "post"}
23619:2023-03-29T15:01:06.606+0200	INFO	7b957.app.7b957.post	initialization: starting to write file #100; target number of labels: 414560, start position: 62500000	{"node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "app", "node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "post"}
23620:2023-03-29T15:01:06.607+0200	INFO	7b957.app.7b957.post	initialization: file #100 completed; number of labels written: 0	{"node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "app", "node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "post"}
23621:2023-03-29T15:01:06.607+0200	INFO	7b957.app.7b957.post	post setup completed	{"node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "app", "node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "post", "node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "commitment_atx": "0657ad4e89", "data_dir": "../pos_data/0", "num_units": "3", "labels_per_unit": "20971520", "provider": "1", "name": "post"}
23622:2023-03-29T15:01:06.607+0200	ERROR	7b957.app.7b957.atxBuilder	Failed to generate proof: %!w(*fmt.wrapError=&{post execution: generate proof: not completed 0x14000c556c0})	{"node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "app", "node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "atxBuilder"}

When I restart the node it starts to initailize again with the following log

46428:2023-03-29T15:11:29.205+0200	INFO	7b957.app.7b957.post	initialization: file #17 already initialized; number of labels: 625000, start position: 10625000	{"node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "app", "node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "post"}
46429:2023-03-29T15:11:29.205+0200	INFO	7b957.app.7b957.post	initialization: file #18 already initialized; number of labels: 625000, start position: 11250000	{"node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "app", "node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "post"}
46430:2023-03-29T15:11:29.205+0200	INFO	7b957.app.7b957.post	initialization: file #19 already initialized; number of labels: 625000, start position: 11875000	{"node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "app", "node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "post"}
46431:2023-03-29T15:11:29.205+0200	INFO	7b957.app.7b957.post	initialization: file #20 already initialized; number of labels: 625000, start position: 12500000	{"node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "app", "node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "post"}
46432:2023-03-29T15:11:29.205+0200	INFO	7b957.app.7b957.post	initialization: continuing to write file #21; current number of labels: 573440, target number of labels: 625000, start position: 13125000	{"node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "app", "node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "post"}

The network config is:

            "post-labels-per-unit": 20971520,
            "post-max-numunits": 5000
        }

node config for smashing is:

            "smeshing-opts-maxfilesize": 10000000,
            "smeshing-opts-numunits": 3,
            "smeshing-opts-provider": 1,
            "smeshing-opts-throttle": false

before the first log what happens is:

2023-03-29T15:01:06.593+0200	INFO	7b957.app.7b957.post	initialization: file #21 completed; number of labels written: 573440	{"node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "app", "node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "post"}
2023-03-29T15:01:06.593+0200	INFO	7b957.app.7b957.post	initialization: starting to write file #22; target number of labels: 625000, start position: 13750000	{"node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "app", "node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "post"}
2023-03-29T15:01:06.593+0200	INFO	7b957.app.7b957.post	initialization: file #22 completed; number of labels written: 0	{"node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "app", "node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "post"}
2023-03-29T15:01:06.593+0200	INFO	7b957.app.7b957.post	initialization: starting to write file #23; target number of labels: 625000, start position: 14375000	{"node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "app", "node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "post"}
2023-03-29T15:01:06.594+0200	INFO	7b957.app.7b957.post	initialization: file #23 completed; number of labels written: 0	{"node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "app", "node_id": "7b957421cc71909520121ea38720e56c2ee45d4faa336e0f9502f65caf62d790", "module": "post"}

Support multiple providers init tasks

Add support for using more than 1 provider for creating a post init job.

The motivation for this are users with multiple gpus on their systems. For example, a laptop with 2 gpus or a desktop with 2 discrete gpu cards and a supported gpu on the motherboard.
We'd like to enable these users to create a post init in less time and fully utilize their system gpu compute resources.

Add API method where user can provide an array of distinct system providers for creating a post job. The implementation should utilize all providers in parallel to create the post init data. Note that the gpu-post lib already supports executions on multiple providers in parallel.

@moshababo

Why 100 labels proven?

Hello, I read in the Chia paper that they only need one challenge, why do you use ~100 challenges on the labels to prove?

add gRPC layer

@noamnelke let me know if you want me to work on it.

Verify POS initialization inflight

We could catch if initialized labels are invalid very early on when they are created and before writing to a file. This would slow down initialization (how much depends on the % of labels checked), but give a higher confidence that the generated POS is valid and allow catching problems early.

It's essential that the problem of invalid labels is communicated to users in a way that is understandable so that they can take action by themself. We should not just panic with the message "labels are invalid" because users will be confused.

Proof generation and verification is very slow

First of all, measuring proof generation performance is not easy as the time to find the proof is random. Changing a single bit of input data might make it do much more work.

I ran some experiments and I was able to come up with input data (8MB POS file) for which it takes 95s to generate a proof on my machine (i7-12700H with 14 cores). Below is the top from pprof:

      flat  flat%   sum%        cum   cum%
    40.17s 33.09% 33.09%     40.17s 33.09%  runtime.futex
     8.96s  7.38% 40.48%     36.37s 29.96%  runtime.selectgo
     6.13s  5.05% 45.53%      9.83s  8.10%  runtime.lock2
     6.02s  4.96% 50.49%      6.18s  5.09%  runtime.unlock2
     5.68s  4.68% 55.17%      5.68s  4.68%  github.com/minio/sha256-simd.blockSha
     3.66s  3.02% 58.18%      3.66s  3.02%  runtime.procyield
     2.79s  2.30% 60.48%      7.80s  6.43%  runtime.mallocgc
     2.43s  2.00% 62.48%      2.43s  2.00%  runtime.memmove
     2.25s  1.85% 64.34%      2.76s  2.27%  runtime.casgstatus
     2.15s  1.77% 66.11%     11.38s  9.38%  runtime.sellock

The proof generation procedure spends most of the time synchronizing the channel that is used to pass input data to workers label by label (8B at a time).

Check the error handling next to postdata_metadata.json

Rationale

There are claims in the community that after initialization postdata_metadata.json was corrupted or empty.
Some say that it happened because of running out of disk space, but some Users do not have any problems with disk space. So it seems that in case PoS initialization fails for some reason — then it breaks everything.

That means:

user lost nonce found during initialization
he needs to find new nonce which is "not that easy" case.
User cannot start the Node (it crashes)

We need to

Handle properly failures during PoS initialization: do not leave files in an inconsistent state if possible.
Since there still might be I/O problems — Node should handle the case correctly.
So if postdata_metadata.json is corrupted/empty, while there is no postdata_N.bin — it can just recreate everything.
If there are some post data already generate — then it is a more complicated case and most likely we need the User's attention to decide what to do with it.
For example, let's say we cannot write valid JSON or remove an inconsistent file because User unplugged his external hard drive — then the Node should not crash if no PoS data generated yet. Just recreated everything.

gpu smesher setup - master tracking issue

Manage the gpu contractors, build, review and benchmark their code.
Currently setting the dev env on one of our windoz box and trying to build & benchmark.

Tests are failed on windows because file subsytem specific.

TestInisialize, TestInitialize_NumUnits_Decrease, TestInitialize_MutlipleFiles are failed.

Fails are related to windows file system specific. Fix requires slightly rewrite file operations.

Refactor status updates by PoST Initializer

Based on the discussion in these two comments:

At the moment the Initializer type is overly complex in its usage. It is designed to be re-used, when it doesn't need to be and over-uses channels and go routines. The type can be simplified in the following manner:

Simplify Initializer by making it non-reusable; besides tests the type isn't used in this manner anyway
Remove Reset(): it is only used in tests and the directory used by Initializer is known to the caller. In tests the caller can delete the directory itself. In Production this shouldn't be done anyway; switching identities should instead use a different datadir for PoST.
Remove SessionNumLabelsWrittenChan(); there is no need for this method when SessionNumLabelsWritten() returns the same information in a synchronous way. Callers interested in updates should rely on the values returned by the second method and the status from Started(), Completed(), isInitializing() or even a new single method that combines the purpose of all 3.
Initialize() should take a context as parameter and instead of Stop() callers should just cancel the context they passed in.

Refactorings of types using Initializer in go-spacemesh:

PostSetupManager:
- StatusChan() should just call SessionNumLabelsWritten() and return the status directly instead of via a channel
SmesherService:
- PostStetupStatusStream() should instead of relying on receiving values in regular intervals via the channel, call the (new) synchronous method in PostSetupManager in regular intervals.

Add POS verification to `postcli`

Add a new command to postcli to verify POS data at a given location. The idea is to cross-check a randomly picked subset of labels in the POS data against labels generated with the CPU implementation.

Implement verifier in a separate package

Validate scrypt params when performing POS initialization

post/config/config.go

Line 81 in eee6c3c

func (p *ScryptParams) Validate() error {

Is not called as part of the initialization flow.

Dynamically calculate BatchSize

Not sure if it's possible but it would be good if we could calculate it dynamically based on:

GPU chararcteristics
hardcoded table(?)

Provide a set of benchmarks / validations that will check the user platform and guesstimate how much storage could be commited

We need to create some "bench" so no one commits too much storage. We need to make sure that:

benchmark the proof time generation (per space unit)
- measure disk speed
- measure CPU speed
- measure GPU speed for data generation -> that's kinda nice to have part.
benchmark the proof validation (per proof)

Post-rs integration creates new intializer per every batch

2023/05/12 09:33:41     DEBUG   initialization: file #10 current position: 9437184, remaining: 57671680
Using provider: [GPU] NVIDIA CUDA/NVIDIA GeForce RTX 3090
device memory: 24259 MB, max_mem_alloc_size: 6064 MB, max_compute_units: 82, max_wg_size: 1024
preferred_wg_size_multiple: 32, kernel_wg_size: 256
Using: global_work_size: 12128, local_work_size: 32
Allocating buffer for input: 32 bytes
Allocating buffer for output: 388096 bytes
Allocating buffer for lookup: 6358564864 bytes
2023/05/12 09:33:46     DEBUG   initialization: file #10 current position: 10485760, remaining: 56623104
Using provider: [GPU] NVIDIA CUDA/NVIDIA GeForce RTX 3090
device memory: 24259 MB, max_mem_alloc_size: 6064 MB, max_compute_units: 82, max_wg_size: 1024
preferred_wg_size_multiple: 32, kernel_wg_size: 256
Using: global_work_size: 12128, local_work_size: 32
Allocating buffer for input: 32 bytes
Allocating buffer for output: 388096 bytes
Allocating buffer for lookup: 6358564864 bytes
2023/05/12 09:33:50     DEBUG   initialization: file #10 current position: 11534336, remaining: 55574528
Using provider: [GPU] NVIDIA CUDA/NVIDIA GeForce RTX 3090
device memory: 24259 MB, max_mem_alloc_size: 6064 MB, max_compute_units: 82, max_wg_size: 1024
preferred_wg_size_multiple: 32, kernel_wg_size: 256
Using: global_work_size: 12128, local_work_size: 32
Allocating buffer for input: 32 bytes
Allocating buffer for output: 388096 bytes
Allocating buffer for lookup: 6358564864 bytes
2023/05/12 09:33:54     DEBUG   initialization: file #10 current position: 12582912, remaining: 54525952
Using provider: [GPU] NVIDIA CUDA/NVIDIA GeForce RTX 3090
device memory: 24259 MB, max_mem_alloc_size: 6064 MB, max_compute_units: 82, max_wg_size: 1024
preferred_wg_size_multiple: 32, kernel_wg_size: 256
Using: global_work_size: 12128, local_work_size: 32
Allocating buffer for input: 32 bytes
Allocating buffer for output: 388096 bytes
Allocating buffer for lookup: 6358564864 bytes
2023/05/12 09:33:59     DEBUG   initialization: file #10 current position: 13631488, remaining: 53477376
Using provider: [GPU] NVIDIA CUDA/NVIDIA GeForce RTX 3090
device memory: 24259 MB, max_mem_alloc_size: 6064 MB, max_compute_units: 82, max_wg_size: 1024
preferred_wg_size_multiple: 32, kernel_wg_size: 256
Using: global_work_size: 12128, local_work_size: 32
Allocating buffer for input: 32 bytes
Allocating buffer for output: 388096 bytes
Allocating buffer for lookup: 6358564864 bytes

It should not work like that. That causes a significant performance drop (more than 50%)

Execute unit tests on arm64 runners

Acceptance criteria

Unit test job of post as well as the release jobs are also executed on linux/arm64 and darwin/arm64

Implementation hints

Use self hosted VMs already available for the organization.

Instead of returning the first Nonce below difficulty, return the absolute lowest

Summary:

The work oracle is implemented to return the index of the first hash that is below the given difficulty threshold for Pow when calculating labels.

Instead it should return the index of the hash with the lowest value between StartPosition and EndPosition. Additionally when calculating leaves in batches the Initializer should continue to look for indices where the resulting hash is lower than the one already found.

The reason for this is that instead of finding the first Nonce that satisfies the given difficulty, this finds the "best" (lowest) nonce. If a smesher decides to increase their PoST storage in the future this gives them a higher chance of being able to re-use the nonce instead of being required to search for a new one. Additionally if the lowest found nonce doesn't satisfy the difficulty for the larger PoST they can be sure no index in the PoST already calculated does.

Acceptance criteria:

add a benchmark to compare the speed of looking for a nonce vs. not using pow during leaf calculation to estimate the impact of the change
- if looking for a nonce is costly (i.e. speed of initialization drops by more than 10%) we should not implement this change
when looking for a nonce don't return the first one below the difficulty threshold, but rather the lowest one
if a nonce was found in a batch during the initialization procedure use the value of the found nonce for the next batch to look for "better" nonces
- alternatively if it simplifies the implementation gpu-post can also just return all nonces that satisfy the difficulty threshold and post can use the CPU to find the lowest one among those found, since we are only expecting ~ 8 nonces during initialization the impact of this should be negligible.

Implementation hints:

in gpu-post the comparison from the calculated hash to D (the difficulty threshold) is here. This needs to be changed such that D is updated to the value found and the loop isn't aborted.
in gpu-post additionally to the index of the found nonce it's value should be returned as well so it can be used as new difficulty threshold
in post a found nonce should not stop the oracle from looking for one, but instead here the difficulty should be updated with the found value and the Initializer should look for better Nonces in successive batches

code coverage isn't measured and reported

The PoST repo is missing a coverage job that uploads code coverage to codecov.io. As a reference the job from poet can be used:

https://github.com/spacemeshos/poet/blob/6aa1e97ee23911bae6292c737345e2b4c9cd468f/.github/workflows/ci.yml#L94-L111

and

https://github.com/spacemeshos/poet/blob/develop/.github/codecov.yml

postcli limits log messages to 100/s

The postcli is configured with very opinionated zap's Production preset. It is configured with a Sampler limiting the rate of logging up to 100 logs/s. It causes big confusion among users - postcli output looks like it was broken.

postcli writes key.bin as binary, while node expects hex

when you initialize with postcli you'll get error:

023-07-10T22:21:18.166Z	INFO	00000.defaultLogger	Looking for identity file at `/spacemesh/data/post_data/key.bin`
2023-07-10T22:21:18.166Z	FATAL	00000.defaultLogger	could not retrieve identity: decoding private key: encoding/hex: invalid byte: U+00E5 'å'	{"name": ""}

when trying to use same key in go-sm.

Validate POST data app

It will be really great if we can provide a cross-platform cli app that verifies post.
Not sure what's the best way to implement this - maybe we add a smesher API method to verify post and we provide this via smrepl (was CLIWallet)?
The motivation is for users to be able to quickly verify an existing post data.
@moshababo - what do you think?

GPU-Post Integration

Copied from the Epic which was open in go-spacemesh about this task:

We want to implement a stand-alone desktop GPU post generator process that can be started from other local processes such as a full node and provide a simple post init and progress api to local clients.

Must run on os x, linux and windows 10.
Use the gpu-post c lib internally for post generation.
Implement back-off when user interactivity is detected (@noamnelke idea)

GPU seems underutilized during POST generation process

command:

~/smesh/post/build$ ./postcli -commitmentAtxId=c230c51669d1fcd35860131e438e234726b2bd5f9adbbd91bd88a718e7e98ecb -provider=1 -id=804ce71657a91d935936700c03c0fa8ffb15a6dc0d1dd56b76c90299dd591334

output:

2023/01/24 14:18:59 	INFO	initialization: datadir: /home/alchemist/post/data, number of units: 99000001, max file size: 4294967296000, number of labels per unit: 4096, number of bits per label: 4096
2023/01/24 14:18:59 	INFO	initialization: files layout: number of files: 1, number of labels per file: 4294967296000, last file number of labels: 405504004096
2023/01/24 14:18:59 	INFO	initialization: continuing to write file #0; current number of labels: 4194304, target number of labels: 405504004096, start position: 0
2023/01/24 14:18:59 	DEBUG	initialization: file #0 current position: 4194304, remaining: 405499809792
WARNING: lavapipe is not a conformant vulkan implementation, testing use only.
2023/01/24 14:19:01 	DEBUG	initialization: file #0 current position: 4210688, remaining: 405499793408
2023/01/24 14:19:02 	DEBUG	initialization: file #0 current position: 4227072, remaining: 405499777024
2023/01/24 14:19:03 	DEBUG	initialization: file #0 current position: 4243456, remaining: 405499760640
2023/01/24 14:19:04 	DEBUG	initialization: file #0 current position: 4259840, remaining: 405499744256
2023/01/24 14:19:05 	DEBUG	initialization: file #0 current position: 4276224, remaining: 405499727872
2023/01/24 14:19:06 	DEBUG	initialization: file #0 current position: 4292608, remaining: 405499711488

providers:

([]gpu.ComputeProvider) (len=2 cap=2) {
 (gpu.ComputeProvider) {
  ID: (uint) 1,
  Model: (string) (len=32) "llvmpipe (LLVM 12.0.0, 256 bits)",
  ComputeAPI: (gpu.ComputeAPIClass) Vulkan
 },
 (gpu.ComputeProvider) {
  ID: (uint) 2,
  Model: (string) (len=3) "CPU",
  ComputeAPI: (gpu.ComputeAPIClass) CPU
 }
}

nvidia-smi:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.161.03   Driver Version: 470.161.03   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA RTX A4000    Off  | 00000000:03:00.0 Off |                  Off |
| 41%   46C    P8    10W / 140W |      7MiB / 16117MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX A4500    Off  | 00000000:82:00.0 Off |                  Off |
| 30%   35C    P8    23W / 200W |      3MiB / 20186MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     56135      G   ./postcli                           3MiB |
+-----------------------------------------------------------------------------+

To summarize, I can tell from htop, provider 1 seems to be using a multicore approach, but I still hardly see any utilization across either GPU.

libpost is missing from postcli artifact

We must fix the pipeline so the needed libraries are included in the artifacts. Currently Linux build does not have libpost.so included

CL_MEM_OBJECT_ALLOCATION_FAILURE on GTX 980

2023-06-16T06:13:53.748+1000    DEBUG    5c0d8.post    Using provider: [GPU] NVIDIA CUDA/NVIDIA GeForce GTX 980    {"node_id": "5c0d8fa190b0ec32a285fd2f663670939ac7a1367b5ab747550cb9403e19fa07", "module": "post", "module": "scrypt_ocl", "file": "scrypt-ocl/src/lib.rs", "line": 314}
2023-06-16T06:13:53.748+1000    DEBUG    5c0d8.post    device memory: 4036 MB, max_mem_alloc_size: 1009 MB, max_compute_units: 16, max_wg_size: 1024    {"node_id": "5c0d8fa190b0ec32a285fd2f663670939ac7a1367b5ab747550cb9403e19fa07", "module": "post", "module": "scrypt_ocl", "file": "scrypt-ocl/src/lib.rs", "line": 134}
2023-06-16T06:13:53.824+1000    DEBUG    5c0d8.post    preferred_wg_size_multiple: 32, kernel_wg_size: 256    {"node_id": "5c0d8fa190b0ec32a285fd2f663670939ac7a1367b5ab747550cb9403e19fa07", "module": "post", "module": "scrypt_ocl", "file": "scrypt-ocl/src/lib.rs", "line": 168}
2023-06-16T06:13:53.824+1000    DEBUG    5c0d8.post    Using: global_work_size: 2016, local_work_size: 32    {"node_id": "5c0d8fa190b0ec32a285fd2f663670939ac7a1367b5ab747550cb9403e19fa07", "module": "post", "module": "scrypt_ocl", "file": "scrypt-ocl/src/lib.rs", "line": 181}
2023-06-16T06:13:53.824+1000    DEBUG    5c0d8.post    Allocating buffer for input: 32 bytes    {"node_id": "5c0d8fa190b0ec32a285fd2f663670939ac7a1367b5ab747550cb9403e19fa07", "module": "post", "module": "scrypt_ocl", "file": "scrypt-ocl/src/lib.rs", "line": 185}
2023-06-16T06:13:53.824+1000    DEBUG    5c0d8.post    Allocating buffer for output: 64512 bytes    {"node_id": "5c0d8fa190b0ec32a285fd2f663670939ac7a1367b5ab747550cb9403e19fa07", "module": "post", "module": "scrypt_ocl", "file": "scrypt-ocl/src/lib.rs", "line": 193}
2023-06-16T06:13:53.824+1000    DEBUG    5c0d8.post    Allocating buffer for lookup: 1056964608 bytes    {"node_id": "5c0d8fa190b0ec32a285fd2f663670939ac7a1367b5ab747550cb9403e19fa07", "module": "post", "module": "scrypt_ocl", "file": "scrypt-ocl/src/lib.rs", "line": 201}
Also this
2023-06-16T17:40:17.305+1000    DEBUG    5c0d8.post    Using provider: [GPU] NVIDIA CUDA/NVIDIA GeForce GTX 960    {"node_id": "5c0d8fa190b0ec32a285fd2f663670939ac7a1367b5ab747550cb9403e19fa07", "module": "post", "module": "scrypt_ocl", "file": "scrypt-ocl/src/lib.rs", "line": 314}
2023-06-16T17:40:17.305+1000    DEBUG    5c0d8.post    device memory: 1996 MB, max_mem_alloc_size: 499 MB, max_compute_units: 8, max_wg_size: 1024    {"node_id": "5c0d8fa190b0ec32a285fd2f663670939ac7a1367b5ab747550cb9403e19fa07", "module": "post", "module": "scrypt_ocl", "file": "scrypt-ocl/src/lib.rs", "line": 134}

2023-06-19T06:03:59.717+1000    DEBUG    5c0d8.post    Allocating buffer for input: 32 bytes    {"node_id": "5c0d8fa190b0ec32a285fd2f663670939ac7a1367b5ab747550cb9403e19fa07", "module": "post", "module": "scrypt_ocl", "file": "scrypt-ocl/src/lib.rs", "line": 185}
2023-06-19T06:03:59.717+1000    DEBUG    5c0d8.post    Allocating buffer for output: 31744 bytes    {"node_id": "5c0d8fa190b0ec32a285fd2f663670939ac7a1367b5ab747550cb9403e19fa07", "module": "post", "module": "scrypt_ocl", "file": "scrypt-ocl/src/lib.rs", "line": 193}
2023-06-19T06:03:59.717+1000    DEBUG    5c0d8.post    Allocating buffer for lookup: 520093696 bytes    {"node_id": "5c0d8fa190b0ec32a285fd2f663670939ac7a1367b5ab747550cb9403e19fa07", "module": "post", "module": "scrypt_ocl", "file": "scrypt-ocl/src/lib.rs", "line": 201}
2023-06-19T06:03:59.735+1000    DEBUG    5c0d8.post    initializing 1 -> 993 (992 labels, GWS: 992)    {"node_id": "5c0d8fa190b0ec32a285fd2f663670939ac7a1367b5ab747550cb9403e19fa07", "module": "post", "module": "scrypt_ocl", "file": "scrypt-ocl/src/lib.rs", "line": 253}
thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: OclError(OclCore(Api(

################################ OPENCL ERROR ############################### 

Error executing function: clEnqueueNDRangeKernel("scrypt")  

Status error code: CL_MEM_OBJECT_ALLOCATION_FAILURE (-4)  

Please visit the following url for more information: 

https://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/clEnqueueNDRangeKernel.html#errors  

############################################################################# 
)))', ffi/src/initialization.rs:135:10

Check how PoST Proof generation handles missing / corrupted data

Description

When generating a PoST proof a node can fail to read data from disk, when some or all of the PoST data was deleted, or it could generate a proof based on corrupted / altered PoST data.

Acceptance criteria

A node must verify a generated proof after creating it.
- If the proof is detected as invalid mark the used label indices as corrupted and try again (only once)
- Failing to generate a proof must be logged such that an operator of a node is informed
If data is lost by disk failure or data corruption, the operator should be informed and given the option to continue smeshing with a smaller NumUnit value.
- Define the limit of lost/corrupted data that triggers the notification to the operator.
- Define an API for via which smapp is signalled to display the information and give the operator the choice to stop smeshing or decrease the NumUnits value.
- If the operator chooses to decrease the NumUnits value update the post metadata such that the next time an ATX is published is uses the lower value.
- If the user fails to respond before the deadline to create and publish an ATX automatically disable smashing.

After genesis

The operator is given the option to recreate the missing data and increase their PoST size again.

use functional options for functions in `internal` packages

As discussed here:

#173 (comment)

initial Post indices included in challenge does not equal to the initial Post indices included in the atx

We have multiple occurrences of initial Post indices included in challenge does not equal to the initial Post indices included in the atx in the logs from different users.

It needs to be debugged and categorized correctly. User claim that he DID NOT reinitialize or change initialization.
Third_PC_with_4070_withoutRewards(1).zip
spacemesh-log-7f8f332c.txt.2(1).zip

Use structured logging in post

The Initializer has the option to be passed a logger with WithLogger as parameter. At the moment that logger is assumed to implement the interface shared.Logger. This interface unfortunately doesn't allow for a structured logger like zap and has the additional drawback that we need custom implementations during testing like here:

post/initialization/initialization_test.go

Lines 28 to 35 in f3ec621

 type testLogger struct { 

 shared.Logger 

 t *testing.T 

 } 

 func (l testLogger) Info(msg string, args ...any) { l.t.Logf("\tINFO\t"+msg, args...) } 

 func (l testLogger) Debug(msg string, args ...any) { l.t.Logf("\tDEBUG\t"+msg, args...) }

We should remove the interface and instead have direct dependency to zap for logging in post. This would allow us to more easily test logging with zaptest and enable structured logging in post.

	type testLogger struct {
	shared.Logger

	t *testing.T
	}

	func (l testLogger) Info(msg string, args ...any) { l.t.Logf("\tINFO\t"+msg, args...) }
	func (l testLogger) Debug(msg string, args ...any) { l.t.Logf("\tDEBUG\t"+msg, args...) }

spacemeshos / post Goto Github PK

post's People

Contributors

Stargazers

Watchers

Forkers

post's Issues

Problem statement

DoD

Description

Acceptance criteria

Description

Description

Acceptance criteria

Implementation hints

Description

Acceptance Criteria

Rationale

We need to

Acceptance criteria

Implementation hints

Summary:

Acceptance criteria:

Implementation hints:

Description

Acceptance criteria

Recommend Projects

Recommend Topics

Recommend Org