parallel101 / course Goto Github PK

高性能并行编程与优化 - 课件

Home Page: https://space.bilibili.com/263032155

License: Other

C++ 72.80% Shell 0.30% Makefile 0.01% CMake 2.77% C 16.21% HTML 0.08% Python 2.66% Less 1.97% Starlark 0.10% Cuda 3.03% Batchfile 0.01% Java 0.01% Vim Script 0.07% Vue 0.01% CSS 0.01%

cpp17 course slides parallel-computing high-performance-computing cpp

course's Issues

第三节课、ppt第54页，用 [=] 来捕获传参的lambda表达式为什么占4个字节？

按照我的理解是：
1：函数指针8个字节
2：带int变量fac，4个字节。应该是12个字节才对。

但是这个地方理解不清楚，还请彭哥帮忙解答一下

mtqueue.hpp中的代码细节问题

slides/thread/mtqueue.hpp中，std::optional try_pop_until函数中，使用的是m_cv_empty.wait_for而不是wait_until。

请教一个关于static和static inline的问题

小彭老师您好，我想请教一个static和static inline的问题，我网上查询的是如果这两个来修饰函数是几乎没有区别的，一下是我对两者的简单理解：

首先是inline现在已经和优化关系不大了，现在已经是在头文件定义函数时使用inline修饰后函数名就自动为弱符号了，我们多个源文件包含这个的时候就不会报错说multi define了。
另外一点是和static一起来说，inline适合那种将函数声明和定义放到一个头文件的情况。inline属于那种如果多文件引用的话引用的是同一个函数，但是用static修饰的话那我们多个源文件引用就会每个文件都拷贝一份函数。
- 另外补充一点，inline修饰的函数内的static变量被多文件引用该头文件后还是同一个，但是如果是static修饰的函数那么每个源文件内部的变量是同一个。

我在github上阅读的很多开源代码都是在头文件内定义函数的时候都是使用的static inline，我奇怪的一点是如果上面的理解是对的话，为什么还要加static，这样的话岂不是每个引用头文件的源文件都会有一个自己的函数拷贝？不如仅仅使用inline来所有源文件共享一个不是更好吗？或者说使用static inline相比之下还有什么更深的考虑吗？

谢谢小彭老师！

Cuda自带的thrust没有 <thrust/universal_vector.h>文件

08/06_thrust/01/main.cu文件里面include <thrust/universal_vector.h>显示找不到 <thrust/universal_vector.h>
官方thrust库里面的"universal_allocator.h", "universal_ptr.h", "universal_vector.h"都没有在我电脑cuda的include的文件夹里面。是我安装cuda有问题吗？（2022年安装的cuda）还是更新了吗？#include <thrust/device_vector.h> #include <thrust/host_vector.h>这些没有问题。

gpu任务能耗相关

老师您好，我现在有一个周期性的gpu任务，但是我用nvidia-smi工具发现，在执行完一次任务等待第二次执行该任务的间隔内，gpu功耗并没有降到静态功耗，这是什么原因，该如何解决呢？我的gpu静态功耗是29W，执行任务时功耗会升至120W，执行完功耗就保持87W左右，再执行任务又升至120W。

try_lock() 异常

我在ubuntu20.04上尝试跑了一下07这个代码，出来结果有点奇怪，请问这是什么原因？

#include <cstdio>
#include <mutex>
std::mutex mtx1;
int main() {
    if (mtx1.try_lock())
        printf("succeed\n");
    else
        printf("failed\n");
    if (mtx1.try_lock())
        printf("succeed\n");
    else
        printf("failed\n");
    mtx1.unlock();
    return 0;
}

输出结果都是succeed，按理应该先是succeed后failed

succeed
succeed

关于第十课---多层嵌套稀疏数据结构代码---BUG

实验环境：

# ubuntu20.04
# linux 内核: 5.13.0-48-generic
# gcc 11.0

Q1 ：10/03/01.cpp 编译出错

/sparse_data_struct/00.cpp: In instantiation of ‘static void RootGrid<T, Layout>::_write(Node&, int, int, T) [with Node = const HashBlock<PointerBlock<11, DenseBlock<8, PlaceData<char> > > >; T = char; Layout = HashBlock<PointerBlock<11, DenseBlock<8, PlaceData<char> > > >]’:
/sparse_data_struct/00.cpp:158:22:   required from ‘void RootGrid<T, Layout>::write(int, int, T) const [with T = char; Layout = HashBlock<PointerBlock<11, DenseBlock<8, PlaceData<char> > > >]’
/sparse_data_struct/00.cpp:197:17:   required from here
/sparse_data_struct/00.cpp:152:37: 错误： passing ‘const HashBlock<PointerBlock<11, DenseBlock<8, PlaceData<char> > > >’ as ‘this’ argument discards qualifiers [-fpermissive]
  152 |             auto *child = node.touch(x >> node.bitShift, y >> node.bitShift);
      |                           ~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/sparse_data_struct/00.cpp:87:11: 附注：   在调用‘Node* HashBlock<Node>::touch(int, int) [with Node = PointerBlock<11, DenseBlock<8, PlaceData<char> > >]’时
   87 |     Node *touch(int x, int y) {
      |           ^~~~~
/sparse_data_struct/00.cpp: In instantiation of ‘void DenseBlock<Bshift, Node>::foreach(const Func&) [with Func = RootGrid<char, HashBlock<PointerBlock<11, DenseBlock<8, PlaceData<char> > > > >::_foreach<DenseBlock<8, PlaceData<char> >, main()::<lambda(int, int, char&)> >(DenseBlock<8, PlaceData<char> >&, int, int, const main()::<lambda(int, int, char&)>&)::<lambda(int, int, auto:1*)>; int Bshift = 8; Node = PlaceData<char>]’:
/sparse_data_struct/00.cpp:170:32:   required from ‘static void RootGrid<T, Layout>::_foreach(Node&, int, int, const Func&) [with Node = DenseBlock<8, PlaceData<char> >; Func = main()::<lambda(int, int, char&)>; T = char; Layout = HashBlock<PointerBlock<11, DenseBlock<8, PlaceData<char> > > >]’
/sparse_data_struct/00.cpp:171:25:   required from ‘RootGrid<char, HashBlock<PointerBlock<11, DenseBlock<8, PlaceData<char> > > > >::_foreach<PointerBlock<11, DenseBlock<8, PlaceData<char> > >, main()::<lambda(int, int, char&)> >(PointerBlock<11, DenseBlock<8, PlaceData<char> > >&, int, int, const main()::<lambda(int, int, char&)>&)::<lambda(int, int, auto:1*)> [with auto:1 = DenseBlock<8, PlaceData<char> >]’
/sparse_data_struct/00.cpp:60:25:   required from ‘void PointerBlock<Bshift, Node>::foreach(const Func&) [with Func = RootGrid<char, HashBlock<PointerBlock<11, DenseBlock<8, PlaceData<char> > > > >::_foreach<PointerBlock<11, DenseBlock<8, PlaceData<char> > >, main()::<lambda(int, int, char&)> >(PointerBlock<11, DenseBlock<8, PlaceData<char> > >&, int, int, const main()::<lambda(int, int, char&)>&)::<lambda(int, int, auto:1*)>; int Bshift = 11; Node = DenseBlock<8, PlaceData<char> >]’
/sparse_data_struct/00.cpp:170:32:   required from ‘static void RootGrid<T, Layout>::_foreach(Node&, int, int, const Func&) [with Node = PointerBlock<11, DenseBlock<8, PlaceData<char> > >; Func = main()::<lambda(int, int, char&)>; T = char; Layout = HashBlock<PointerBlock<11, DenseBlock<8, PlaceData<char> > > >]’
/sparse_data_struct/00.cpp:171:25:   required from ‘RootGrid<char, HashBlock<PointerBlock<11, DenseBlock<8, PlaceData<char> > > > >::_foreach<HashBlock<PointerBlock<11, DenseBlock<8, PlaceData<char> > > >, main()::<lambda(int, int, char&)> >(HashBlock<PointerBlock<11, DenseBlock<8, PlaceData<char> > > >&, int, int, const main()::<lambda(int, int, char&)>&)::<lambda(int, int, auto:1*)> [with auto:1 = PointerBlock<11, DenseBlock<8, PlaceData<char> > >]’
/sparse_data_struct/00.cpp:102:17:   required from ‘void HashBlock<Node>::foreach(const Func&) [with Func = RootGrid<char, HashBlock<PointerBlock<11, DenseBlock<8, PlaceData<char> > > > >::_foreach<HashBlock<PointerBlock<11, DenseBlock<8, PlaceData<char> > > >, main()::<lambda(int, int, char&)> >(HashBlock<PointerBlock<11, DenseBlock<8, PlaceData<char> > > >&, int, int, const main()::<lambda(int, int, char&)>&)::<lambda(int, int, auto:1*)>; Node = PointerBlock<11, DenseBlock<8, PlaceData<char> > >]’
/sparse_data_struct/00.cpp:170:32:   required from ‘static void RootGrid<T, Layout>::_foreach(Node&, int, int, const Func&) [with Node = HashBlock<PointerBlock<11, DenseBlock<8, PlaceData<char> > > >; Func = main()::<lambda(int, int, char&)>; T = char; Layout = HashBlock<PointerBlock<11, DenseBlock<8, PlaceData<char> > > >]’
/sparse_data_struct/00.cpp:178:17:   required from ‘void RootGrid<T, Layout>::foreach(const Func&) [with Func = main()::<lambda(int, int, char&)>; T = char; Layout = HashBlock<PointerBlock<11, DenseBlock<8, PlaceData<char> > > >]’
/sparse_data_struct/00.cpp:201:15:   required from here
/sparse_data_struct/00.cpp:27:28: 错误： no match for ‘operator*’ (operand type is ‘PlaceData<char>’)
   27 |                 func(x, y, *m_data[x][y]);
      |

修改后 DenseBlock & HashBlock后可以编译跑通：

template <int Bshift, class Node>
struct DenseBlock {
    static constexpr bool isPlace = false;
    static constexpr bool bitShift = Bshift;

    static constexpr int B = 1 << Bshift;
    static constexpr int Bmask = B - 1;

    Node m_data[B][B];

    Node *fetch(int x, int y) const {
        return &m_data[x & Bmask][y & Bmask];
    }

    Node *touch(int x, int y) {
        return &m_data[x & Bmask][y & Bmask];
    }
    
    // change_00  : * -> &
    template <class Func>
    void foreach(Func const &func) {
        for (int x = 0; x < B; x++) {
            for (int y = 0; y < B; y++) {
                func(x, y, &m_data[x][y]);
            }
        }
    }
};

template <class Node>
struct HashBlock {
    static constexpr bool isPlace = false;
    static constexpr bool bitShift = 0;

    struct MyHash {
        std::size_t operator()(std::tuple<int, int> const &key) const {
            auto const &[x, y] = key;
            return (x * 2718281828) ^ (y * 3141592653);
        }
    };


    //change_01 <Node> -> std::unique_ptr<Node>
    std::unordered_map<std::tuple<int, int>, std::unique_ptr<Node>, MyHash> m_data;

    // change_02  : it->second.get();
    Node *fetch(int x, int y) const {
        auto it = m_data.find(std::make_tuple(x, y));
        if (it == m_data.end())
            return nullptr;
        return it->second.get();
    }

    Node *touch(int x, int y) {
        auto it = m_data.find(std::make_tuple(x, y));
        if (it == m_data.end()) {
            std::unique_ptr<Node> ptr = std::make_unique<Node>();
            auto rawptr = ptr.get();
            m_data.emplace(std::make_tuple(x, y), std::move(ptr));
            return rawptr;
        }
        return it->second.get();
    }
    
    //change_03 &block -> unique_node.get()  (unique_Node = block)
    template <class Func>
    void foreach(Func const &func) {
        for (auto &[key, unique_Node]: m_data) {
            auto &[x, y] = key;
            func(x, y, unique_Node.get());
        }
    }
};

在运行时遇到内存不断增加的状况。当N 为（2 * 2）时，使用valgrind 和 massif-visualizer 查看内存，构造的空间足足有96M，可能是代码改错了，还请小彭老师看看。

Q2 10/04/06.cpp 运行时内存爆炸

为了方便在我自己平台的跑起来，将main中的foreach改成了和之前的代码一样。

	int count = 0;
    a->foreach([&] (int x, int y, char &value) {
        if (value != 0) {
            count++;
        }
    });
    printf("count: %d\n", count);

情况 1 ： N（512 * 512）

结果：运行内存64g全部跑满，最后core dump。

这个可视化的图是在代码运行占到内存一半的时候，ctrl + c 后得到的信息图。

情况 2 ： N（32 * 32 )

count: 109
main: 14.5131s

内存可视化信息如下:
$2U3TY{BVCZZ8W@V1_V)_KAY$

最后想问一下小彭老师，这个代码没有给错吗，为什么会不停的申请内存导致最后完全不够用呢。

10/06/04.cpp 占用内存过多，无法跑

首先是在WSL上跑的，Vemmem把32g内存全部占满勒，就卡住了。
接下来换了台换了个小服务器跑，也是cpu和内存近乎跑满，然后这个程序直接被kill掉了。

呜呜呜，太菜勒，完全不知道为啥跑不通。

std::string 使用迭代器，删除\r的时候，没办法删除干净

	std::string src = "\r\nabc\r\r\r\r\r\r\r\r123456\nABCDEF\r\n\r\n\r\r\r\r";
	std::remove_if(src.begin(), src.end(), [](char c) {
		return c == '\r';
	});

	auto size = src.size();


	//我认为应该是 \nabc123456\nABCDEF\n\n
	//实际却是：src = "\nabc123456\nABCDEF\n\n\nABCDEF\r\n\r\n\r\r\r\r"

最近在学习彭老师的课程，学到了 string其实也有迭代器。然后就找了一个remove_if函数试了一下。

请彭老师帮忙解答一下。这是为什么呢？

我的电脑是win10 64位，使用的是virtual stdio 2022 preview

请教老师第8课cuda并行部分10_stencil部分bug

环境：ubuntu22.04
gcc：9
运行时出现/usr/include/c++/9/bits/stl_vector.h(130): error: no instance of constructor "CudaAllocator::CudaAllocator [with T=float]" matches the argument list
detected during instantiation of "std::_Vector_base<_Tp, _Alloc>::_Vector_impl::_Vector_impl() [with _Tp=float, _Alloc=CudaAllocator]"
(337): here
这个问题

print 函数打印模版特化问题

#include "print.h"
template<typename T>
class TypeToID{
public:
    static int const  ID=-1;
};
template<> class TypeToID<void*>{
public:
   static  int const  ID=1;
};
int main(){


    print(TypeToID<void *>::ID);

    return 0;

}

链接报错找不到符号
.rdata$.refptr._ZN8TypeToIDIPvE2IDE[.refptr._ZN8TypeToIDIPvE2IDE]+0x0): undefined reference to `TypeToID<void*>::ID'

bug : course/06/01_for/07/main.cpp

三维数组vector初始化大小应为n * n * n

std::vector a(n * n); ---------》 std::vector a(n * n *n);

error: missing template arguments before ‘grd’

course/05/03_mutex/04/main.cpp

Line 12 in fe22cd6

std::unique_lock grd(mtx);

course 09 compile error: nvcc fatal : A single input file is required for a non-link phase when an outputfile is specified

OS: Win11
cmake version 3.26.4
CUDA version 12.1

编译过程：

vcpkg安装openvdb
make：PS D:\Projects\parallel101\course\09\01_texture\08> cmake -B .\build -DCMAKE_TOOLCHAIN_FILE=D:\Projects\parallel101\course\09\01_texture\08\vcpkg\scripts\buildsystems\vcpkg.cmake
输出：
`-- Selecting Windows SDK version 10.0.22000.0 to target Windows 10.0.22621.
COMPONENT = openvdb
-- Found OpenVDB: D:/Projects/parallel101/course/09/01_texture/08/vcpkg/installed/x64-windows/include (found version "10.0.0") found components: openvdb
-- OpenVDB ABI Version: 10
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - not found
-- Found Threads: TRUE
CMake Warning at C:/Program Files/CMake/share/cmake-3.26/Modules/FindBoost.cmake:1384 (message):
New Boost version may have incorrect or missing dependencies and imported
targets
Call Stack (most recent call first):
C:/Program Files/CMake/share/cmake-3.26/Modules/FindBoost.cmake:1508 (_Boost_COMPONENT_DEPENDENCIES)
C:/Program Files/CMake/share/cmake-3.26/Modules/FindBoost.cmake:2119 (_Boost_MISSING_DEPENDENCIES)
vcpkg/installed/x64-windows/share/boost/vcpkg-cmake-wrapper.cmake:11 (_find_package)
vcpkg/scripts/buildsystems/vcpkg.cmake:813 (include)
vcpkg/installed/x64-windows/share/openvdb/FindOpenVDB.cmake:504 (find_package)
vcpkg/installed/x64-windows/share/openvdb/vcpkg-cmake-wrapper.cmake:10 (_find_package)
vcpkg/scripts/buildsystems/vcpkg.cmake:813 (include)
CMakeLists.txt:16 (find_package)

CMake Warning at C:/Program Files/CMake/share/cmake-3.26/Modules/FindBoost.cmake:1384 (message):
New Boost version may have incorrect or missing dependencies and imported
targets
Call Stack (most recent call first):
C:/Program Files/CMake/share/cmake-3.26/Modules/FindBoost.cmake:1508 (_Boost_COMPONENT_DEPENDENCIES)
C:/Program Files/CMake/share/cmake-3.26/Modules/FindBoost.cmake:2119 (_Boost_MISSING_DEPENDENCIES)
vcpkg/installed/x64-windows/share/boost/vcpkg-cmake-wrapper.cmake:11 (_find_package)
vcpkg/scripts/buildsystems/vcpkg.cmake:813 (include)
vcpkg/installed/x64-windows/share/openvdb/FindOpenVDB.cmake:504 (find_package)
vcpkg/installed/x64-windows/share/openvdb/vcpkg-cmake-wrapper.cmake:10 (_find_package)
vcpkg/scripts/buildsystems/vcpkg.cmake:813 (include)
CMakeLists.txt:16 (find_package)

-- Found Boost: D:/Projects/parallel101/course/09/01_texture/08/vcpkg/installed/x64-windows/include (found version "1.83.0") found components: iostreams regex
-- Found ZLIB: optimized;D:/Projects/parallel101/course/09/01_texture/08/vcpkg/installed/x64-windows/lib/zlib.lib;debug;D:/Projects/parallel101/course/09/01_texture/08/vcpkg/installed/x64-windows/debug/lib/zlibd.lib (found version "1.3.0")
-- Found OpenVDB 10.0.0 at D:/Projects/parallel101/course/09/01_texture/08/vcpkg/installed/x64-windows/lib/openvdb.lib
-- Configuring done (7.6s)
-- Generating done (0.1s)
CMake Warning:
Manually-specified variables were not used by the project:

CMAKE_TOOLCHAIN_FILE

-- Build files have been written to: D:/Projects/parallel101/course/09/01_texture/08/build`

build：PS D:\Projects\parallel101\course\09\01_texture\08> *cmake --build .\build*
输出：
`MSBuild version 17.4.3+7e646be43 for .NET Framework
Checking Build System
Building Custom Rule D:/Projects/parallel101/course/09/01_texture/08/CMakeLists.txt
Compiling CUDA source file ..\main.cu...

D:\Projects\parallel101\course\09\01_texture\08\build>"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\bin\nvcc.exe" --use-local-env -ccbin "C:\Program Files\Microsof
t Visual Studio\2022\Professional\VC\Tools\MSVC\14.34.31933\bin\HostX64\x64" -x cu -ID:\Projects\parallel101\course\09\01_texture\08. -ID:\Projects\parallel101\course\09\01_
texture\08....\include -I"D:\Projects\parallel101\course\09\01_texture\08\vcpkg\installed\x64-windows\include" -I"D:\Projects\parallel101\course\09\01_texture\08\vcpkg\instal
led\x64-windows\include\Imath" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1\include" --keep-dir x64\Debug -maxrregcount=0 --machine 64 --compile -cudart s
tatic -std=c++17 --generate-code=arch=compute_52,code=[compute_52,sm_52] --extended-lambda --expt-relaxed-constexpr /EHsc -Xcompiler="/EHsc -Ob0 -Zi" -g -D_WINDOWS -DOPENVDB_D
LL -D_WIN32 -DNOMINMAX -DOPENVDB_ABI_VERSION_NUMBER=10 -DOPENVDB_USE_DELAYED_LOADING -DIMATH_DLL -DTBB_USE_DEBUG -D"CMAKE_INTDIR="Debug"" -D_MBCS -DWIN32 -D_WINDOWS -DOPENVDB
_DLL -D_WIN32 -DNOMINMAX -DOPENVDB_ABI_VERSION_NUMBER=10 -DOPENVDB_USE_DELAYED_LOADING -DIMATH_DLL -DTBB_USE_DEBUG -D"CMAKE_INTDIR="Debug"" -Xcompiler "/EHsc /W1 /nologo /Od
/FS /Zi /RTC1 /MDd " -Xcompiler "/Fdmain.dir\Debug\vc143.pdb" -o main.dir\Debug\main.obj "D:\Projects\parallel101\course\09\01_texture\08\main.cu"
nvcc fatal : A single input file is required for a non-link phase when an outputfile is specified
C:\Program Files\Microsoft Visual Studio\2022\Professional\MSBuild\Microsoft\VC\v170\BuildCustomizations\CUDA 12.1.targets(799,9): error MSB3721: 命令“"C:\Program Files\NVIDIA GPU

Computing Toolkit\CUDA\v12.1\bin\nvcc.exe" --use-local-env -ccbin "C:\Program Files\Microsoft Visual Studio\2022\Professional\VC\Tools\MSVC\14.34.31933\bin\HostX64\x64" -x cu
-ID:\Projects\parallel101\course\09\01_texture\08. -ID:\Projects\parallel101\course\09\01_texture\08....\include -I"D:\Projects\parallel101\course\09\01_texture\08\vcpkg\insta
lled\x64-windows\include" -I"D:\Projects\parallel101\course\09\01_texture\08\vcpkg\installed\x64-windows\include\Imath" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.
1\include" --keep-dir x64\Debug -maxrregcount=0 --machine 64 --compile -cudart static -std=c++17 --generate-code=arch=compute_52,code=[compute_52,sm_52] --extended-lambda
--expt-relaxed-constexpr /EHsc -Xcompiler="/EHsc -Ob0 -Zi" -g -D_WINDOWS -DOPENVDB_DLL -D_WIN32 -DNOMINMAX -DOPENVDB_ABI_VERSION_NUMBER=10 -DOPENVDB_USE_DELAYED_LOADING -DIMATH_
DLL -DTBB_USE_DEBUG -D"CMAKE_INTDIR="Debug"" -D_MBCS -DWIN32 -D_WINDOWS -DOPENVDB_DLL -D_WIN32 -DNOMINMAX -DOPENVDB_ABI_VERSION_NUMBER=10 -DOPENVDB_USE_DELAYED_LOADING -DIMATH_
DLL -DTBB_USE_DEBUG -D"CMAKE_INTDIR="Debug"" -Xcompiler "/EHsc /W1 /nologo /Od /FS /Zi /RTC1 /MDd " -Xcompiler "/Fdmain.dir\Debug\vc143.pdb" -o main.dir\Debug\main.obj "D:\Proj
ects\parallel101\course\09\01_texture\08\main.cu"”已退出，返回代码为 1。 [D:\Projects\parallel101\course\09\01_texture\08\build\main.vcxproj]`

谢谢！

print 处理 unsigned long long int 输出异常

unsigned long long int i1 = -1;
print(i1);
预期：
-1
实际：
18446744073709551615

Bbbbbbbbbb

CudaAllocator报错

环境 VS2022 cuda12.2

代码为08课04节

course/08/04_sugar/01/main.cu

Line 7 in 2d30da6

struct CudaAllocator {

报错信息

[build] MSBuild version 17.4.0+18d5aef85 for .NET Framework
[build]   Compiling CUDA source file ..\src\allocator.cu...
[build]   
[build]   C:\Dev\mgxpbd\build>"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.2\bin\nvcc.exe"  --use-local-env -ccbin "C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.34.31933\bin\HostX64\x64" -x cu   -IC:\Dev\mgxpbd\include -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.2\include"     --keep-dir x64\Debug  -maxrregcount=0   --machine 64 --compile -cudart static --generate-code=arch=compute_52,code=[compute_52,sm_52] -std=c++17 -Xcompiler="/EHsc -Ob0 -Zi" -g  -D_WINDOWS -D"CMAKE_INTDIR=\"Debug\"" -D_MBCS -D"CMAKE_INTDIR=\"Debug\"" -Xcompiler "/EHsc /W1 /nologo /Od /FS /Zi /RTC1 /MDd " -Xcompiler "/Fdallocator.dir\Debug\vc143.pdb" -o allocator.dir\Debug\allocator.obj "C:\Dev\mgxpbd\src\allocator.cu" 
[build] C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.34.31933\include\vector(2125): error : no suitable user-defined conversion from "CudaAllocator<int>" to "CudaAllocator<std::_Container_proxy>" exists [C:\Dev\mgxpbd\build\allocator.vcxproj]
[build]             auto&& _Alproxy = static_cast<_Rebind_alloc_t<_Alty, _Container_proxy>>(_Al);
[build]                                                                                     ^
[build]             detected during:
[build]               instantiation of "void std::vector<_Ty, _Alloc>::_Construct_n(std::vector<_Ty, _Alloc>::size_type, _Valty &&...) [with _Ty=int, _Alloc=CudaAllocator<int>, _Valty=<>]" at line 683
[build]               instantiation of "std::vector<_Ty, _Alloc>::vector(std::vector<_Ty, _Alloc>::size_type, const _Alloc &) [with _Ty=int, _Alloc=CudaAllocator<int>]" at line 30 of C:\Dev\mgxpbd\src\allocator.cu
[build]   
[build] C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.34.31933\include\vector(832): error : no suitable user-defined conversion from "CudaAllocator<int>" to "CudaAllocator<std::_Container_proxy>" exists [C:\Dev\mgxpbd\build\allocator.vcxproj]
[build]             auto&& _Alproxy = static_cast<_Rebind_alloc_t<_Alty, _Container_proxy>>(_Getal());
[build]                                                                                     ^
[build]             detected during instantiation of "std::vector<_Ty, _Alloc>::~vector() noexcept [with _Ty=int, _Alloc=CudaAllocator<int>]" at line 30 of C:\Dev\mgxpbd\src\allocator.cu
[build]   
[build] C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.34.31933\include\vector(833): error : no instance of function template "std::_Delete_plain_internal" matches the argument list [C:\Dev\mgxpbd\build\allocator.vcxproj]
[build]               argument types are: (<error-type>, std::_Container_proxy *)
[build]             _Delete_plain_internal(_Alproxy, ::std:: exchange(_Mypair._Myval2._Myproxy, nullptr));
[build]             ^
[build]             detected during instantiation of "std::vector<_Ty, _Alloc>::~vector() noexcept [with _Ty=int, _Alloc=CudaAllocator<int>]" at line 30 of C:\Dev\mgxpbd\src\allocator.cu
[build]   
[build]   3 errors detected in the compilation of "C:/Dev/mgxpbd/src/allocator.cu".
[build]   allocator.cu
[build] C:\Program Files\Microsoft Visual Studio\2022\Community\MSBuild\Microsoft\VC\v170\BuildCustomizations\CUDA 12.2.targets(799,9): error MSB3721: 命令“"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.2\bin\nvcc.exe"  --use-local-env -ccbin "C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.34.31933\bin\HostX64\x64" -x cu   -IC:\Dev\mgxpbd\include

请教一个问题，cmake如何生成XXXTargets.cmake文件

cmake如何生成XXXTargets.cmake文件

报错：no suitable user-defined conversion from "CudaAllocator<int>" to "CudaAllocator<std::_Container_proxy>" exists

在08课程04_sugar\01下面生成 visual studio 2022 项目之后，运行报错如下：
“no suitable user-defined conversion from "CudaAllocator" to "CudaAllocatorstd::_Container_proxy" exists”，请问老师如何修改？

请问一下大佬，后续会公布作业的标准答案吗？

请问TBB支持M1芯片mac吗

intel出的多线程框架在arm处理器M1 mac上支持运行吗？

可以借助解引用操作变通地对 std::unique_ptr 进行深拷贝

course/02/19/main.cpp

Lines 32 to 35 in c8787cf

 C *raw_p = p.get(); 

 func(std::move(p)); 

 raw_p->do_something(); // 正常执行，raw_p 保留了转移前的指针

课件这里的写法是复杂又容易出错的，其实我们可以采取下面这样更安全的方式：

//C* raw_p = p.get();	// no need
func(std::make_unique<C>(*p));	// deep copy
p->do_something();	// OK, run normally

虽然 std::unique_ptr 删除了 copy constructor 和 copy assignment operator ，但其实我们可以借助解引用操作变通地对 std::unique_ptr 进行拷贝。

deep copy 示例如下：

std::unique_ptr<std::string> up1(std::make_unique<std::string>("Good morning"));

// copy construct!
std::unique_ptr<std::string> up2(std::make_unique<std::string>(*up1));
// safe copy construct!
std::unique_ptr<std::string> up3(up1 ? std::make_unique<std::string>(*up1) : nullptr);
// copy assignment!
up2 = std::make_unique<std::string>(*up1);
// safe copy assignment!
up3 = up1 ? std::make_unique<std::string>(*up1) : nullptr;

其它的例证：

Google tensorflow : unique_ptr - in copy assignment operator
Microsoft terminal : unique_ptr - in copy constructor
《Effective Modern C++》，Item22，其中的 class Widget 的 copy constructor 和 copy assignment operator 的实现，见 EffectiveModernCppChinese/item22.md · GitHub

获取枚举名代码不适用于clang+msvc

以下代码不适用于clang15+msvc(vs2022)的情形

#if defined(_MSC_VER)
			size_t pos = s.find(',');
			pos += 1;
			size_t pos2 = s.find('>', pos);
#else

在执行这一段前，s的值为："const char *__cdecl 函数名 [T = 枚举, N = 枚举::枚举常量]"
我现在改为以下代码可以正常使用：

#if defined(_MSC_VER) && !defined(__clang__)
			size_t pos = s.find(',');
			pos += 1;
			size_t pos2 = s.find('>', pos);
#elif defined(__clang__)
            size_t pos = s.find("N = ");
            pos += 1;
            size_t pos2 = s.find(']', pos);
#else
			size_t pos = s.find("N = ");
			pos += 4;
			size_t pos2 = s.find_first_of(";]", pos);
#endif

中文字幕校对

分工

（等有人来了再细分吧

p01 = 80min = 5min * 16

00-05
05-10
10-15

p02 = 135min = 5min * 27
p03 = 110min = 5min * 22
p04 = 112min = 5min * 22

目前我估测 5min 视频大约对应 30~40 句话，我自己因为很熟练大概需要半小时能校对完。时间仅供参考。

问题：在使用weak_ptr解决循环引用的例子中，有两行代码是不是写反了

@archibate

下面两行代码是不是写反了：

course/02/24/3.cpp

Lines 13 to 14 in 3940bba

 parent->m_child = std::move(child); // 移交 child 的所属权给 parent 

 child->m_parent = parent.get();

是不是应该先获取裸指针，再使用std::move移交拥有权，如下：

 child->m_parent = parent.get(); 
 parent->m_child = std::move(child);  // 移交 child 的所属权给 parent

CudaAllocator 模板错误

我的代码中，在vs2022中，凡是使用CudaAllocator 的地方都会编译出错：“no instance of constructor "CudaAllocator::CudaAllocator [with T=int]" matches the argument list main” ，不知道是为什么？
”

关于 07/03_prefetch/06 运行结果的疑问

hi, 小彭老师好。关于 07/03_prefetch/06 例子运行结果我有一些疑问，望指正。
我的平台是 Intel i5-13500, Ubuntu 24.04, gcc version 13.2.0
在运行 07/03_prefetch/06 这个例子时，
去掉例子中的 #pragma omp parallel for 才能得到与课程中类似的结果。我不清楚 #pragma omp parallel for 是否除了并行之外还有其他的优化？

原始版本运行结果

从运行结果可以看到，BM_write_stream_then_read 跟 BM_write_streamed 运行耗时相近，似乎读对 stream 指令并没有影响

-----------------------------------------------------------------------
Benchmark                             Time             CPU   Iterations
-----------------------------------------------------------------------
BM_read                        25228152 ns     18180668 ns           38
BM_write                       32696238 ns     25309548 ns           33
BM_write_streamed              19530899 ns     17132181 ns           36
BM_write_stream_then_read      19586335 ns     17525509 ns           43
BM_write_streamed_ps           19550735 ns     14485110 ns           39
BM_write_streamed_ps_skipped   37094026 ns     26238143 ns           26
BM_read_and_write              36829027 ns     33520956 ns           22

去除 #pragma omp parallel for 版本运行结果

从运行结果可以看到，BM_write_stream_then_read 运行耗时显著比 BM_write_streamed 长

-----------------------------------------------------------------------
Benchmark                             Time             CPU   Iterations
-----------------------------------------------------------------------
BM_read                        38213301 ns     38207623 ns           19
BM_write                       52209723 ns     52203705 ns           13
BM_write_streamed              34738316 ns     34735390 ns           20
BM_write_stream_then_read      40930259 ns     40927256 ns           17
BM_write_streamed_ps           17725541 ns     17724305 ns           36
BM_write_streamed_ps_skipped   36891533 ns     36889477 ns           19
BM_read_and_write              44972351 ns     44969916 ns           12

	C *raw_p = p.get();
	func(std::move(p));

	raw_p->do_something(); // 正常执行，raw_p 保留了转移前的指针

	parent->m_child = std::move(child); // 移交 child 的所属权给 parent
	child->m_parent = parent.get();