950基础矩阵乘法TLA示例
950 Basic Matmul TLA Example Readme【免费下载链接】catlass本项目是CANN的算子模板库提供NPU上高性能矩阵乘及其相关融合类算子模板样例。项目地址: https://gitcode.com/cann/catlassNote: The community package does not currently support 950 capabilities. Stay tuned for a future supported version.Code Organization├── 43_ascend950_basic_matmul │ ├── CMakeLists.txt # CMake build file │ ├── README.md │ └── basic_matmul_tla.cpp # Main fileUsage ExampleAfter obtaining the code, build the corresponding operator executable. See Template Library Quick Start. This case is a 950 operator, and-DCATLASS_ARCH3510must be added during build.Run the operator.# Build the specified case bash scripts/build.sh 43_ascend950_basic_matmul -DCATLASS_ARCH3510 cd output/bin # Executable file name | matrix m axis | n axis | k axis | Device ID # Device ID is optional and defaults to 0 ./43_ascend950_basic_matmul 256 512 1024 0The execution result is as follows, indicating that the precision comparison succeeds.Compare success.Usage NotesThe DispatchPolicy MmadPingpong used by BasicMatmul by default supports the following template parameters:Template ParameterDefault ValueParameter DescriptionArchTagNoneSpecifies the architecture modelenableUnitFlagfalseSpecifies whether to enable UnitFlag. It must be set to false when L0C multi-buffering is enableduseHF32falseSpecifies whether to enable HF32. Only the float type is supportedl0CStages1Specifies the number of L0C buffers. Set it to 2 to enable L0C double bufferingenableL1ResidentfalseSpecifies whether to enable L1 residencyl1AStages2Number of buffers for loading matrix A on L1l1BStages2Number of buffers for loading matrix B on L1l0AStages2Number of buffers for loading matrix A on L0l0BStages2Number of buffers for loading matrix B on L0Assume the matrix Shape isM N K, the tile size on L1 ism1 n1 k1, the number of tiles in the M direction ismTiles CeilDiv(M, m1), the number of tiles in the N direction isnTiles CeilDiv(N, n1), and the total number of tasks istaskBlocks mTiles * nTiles. enableL1Resident can be enabled in the following two cases:mTiles 1,nTiles CoreNum, andK 2 * k1. In this case,l0CStages2can also be set (enableUnitFlag must be disabled). If there is not enough space andl0CStages2cannot be set, setn1to half of the original value.nTiles 1,mTiles CoreNum, andK 2 * k1. In this case,l0CStages2can also be set (enableUnitFlag must be disabled). If there is not enough space andl0CStages2cannot be set, setm1to half of the original value.BasicMatmul also supports DispatchPolicy MmadPreloadAsyncWithCallback, which supports the following template parameters:Template ParameterDefault ValueParameter DescriptionArchTagNoneSpecifies the architecture modelpreloadStagesNoneSpecifies the number of preloadsl1AStages2Number of buffers for loading matrix A on L1l1BStages2Number of buffers for loading matrix B on L1l0AStages2Number of buffers for loading matrix A on L0l0BStages2Number of buffers for loading matrix B on L0l0CStages1Specifies the number of L0C buffers. Set it to 2 to enable L0C double bufferingenableUnitFlagfalseSpecifies whether to enable UnitFlag. It must be set to false when L0C multi-buffering is enabledenableShuffleKfalseSpecifies whether to enable K-direction staggered readinguseHF32falseSpecifies whether to enable HF32. Only the float type is supportedenableL1ResidentfalseSpecifies whether to enable L1 residencyCompared withMmadPingpong,MmadPreloadAsyncWithCallbackhas two more template parameters. One ispreloadStages. This parameter is usually set to 1 and specifies the number of preloads. When this parameter is set to 1, the first loop only loads data and does not perform matmul computation. The second loop first loads the data for the second loop, and then completes the Matmul computation of the previous loop, and so on. After the final loop ends, one additional Matmul computation is performed. The benefit is that the data required for the current Matmul computation has already been moved in the previous loop. Therefore, instruction issue is advanced, which reduces the performance loss caused by instruction issue latency.The second parameter isenableShuffleK. This parameter is mainly used to avoid bandwidth loss caused by same-address access conflicts. The main principle is to stagger the data read addresses of each core. This parameter does not need to be enabled on 950.Compared withMmadPingpong,MmadPreloadAsyncWithCallbackhas more optimization points, but its logic is also more complex and has higher Scalar overhead. Use it based on the scenario, especially for small Shape scenarios.【免费下载链接】catlass本项目是CANN的算子模板库提供NPU上高性能矩阵乘及其相关融合类算子模板样例。项目地址: https://gitcode.com/cann/catlass创作声明:本文部分内容由AI辅助生成(AIGC),仅供参考

相关新闻

Ngx-restangular 测试策略:单元测试和集成测试完整指南

Ngx-restangular 测试策略:单元测试和集成测试完整指南

Ngx-restangular 测试策略:单元测试和集成测试完整指南 【免费下载链接】ngx-restangular Restangular for Angular 2 and higher versions 项目地址: https://gitcode.com/gh_mirrors/ng/ngx-restangular Ngx-restangular 是一个强大的 Angular RESTful 客户…

2026/6/24 6:13:03阅读更多 →
TimetableLayout实战:构建音乐节演出时间表的完整示例

TimetableLayout实战:构建音乐节演出时间表的完整示例

TimetableLayout实战:构建音乐节演出时间表的完整示例 【免费下载链接】TimetableLayout TimetableLayout is a RecyclerView.LayoutManager to display the timetable for Android. 项目地址: https://gitcode.com/gh_mirrors/ti/TimetableLayout &#x1f…

2026/6/24 6:13:03阅读更多 →
Javinizer元数据聚合策略:多源数据合并与优先级设置技巧

Javinizer元数据聚合策略:多源数据合并与优先级设置技巧

Javinizer元数据聚合策略:多源数据合并与优先级设置技巧 【免费下载链接】Javinizer (NSFW) Organize your local Japanese Adult Video (JAV) library 项目地址: https://gitcode.com/gh_mirrors/ja/Javinizer Javinizer是一款功能强大的日本成人视频库管理…

2026/6/24 6:13:03阅读更多 →
Jest DOM测试性能优化实战:从配置、查询到异步处理的完整指南

Jest DOM测试性能优化实战:从配置、查询到异步处理的完整指南

1. 项目概述:为什么你的DOM测试慢如蜗牛?最近在帮团队做Code Review,发现一个挺普遍的现象:很多同学写的Jest单元测试,单个跑起来飞快,但一旦集成到整个测试套件里,运行时间就指数级增长&#x…

2026/6/24 7:28:08阅读更多 →
嵌入式Bootloader串行引导协议:BAM硬件握手与代码加载全解析

嵌入式Bootloader串行引导协议:BAM硬件握手与代码加载全解析

1. BAM串行引导协议深度解析:从硬件握手到代码执行在嵌入式开发,尤其是汽车电子和工业控制领域,系统上电后的第一行代码如何安全、可靠地加载,是决定产品稳定性和后期维护便利性的基石。很多工程师都遇到过这样的场景:…

2026/6/24 7:28:08阅读更多 →
太赫兹成像技术:从原理到应用,实现非接触式“透视”检测

太赫兹成像技术:从原理到应用,实现非接触式“透视”检测

1. 项目概述:从科幻到现实的“透视”技术“忘掉X光吧,用T射线,你能隔着一本书的封面读到里面的内容。” 这句话听起来像是直接从科幻电影里截取的台词,但它描述的是正在实验室里快速发展的前沿技术——太赫兹成像。作为一名长期关…

2026/6/24 7:28:08阅读更多 →
深入解析飞思卡尔PXN20 MCU:架构、外设与系统集成实战

深入解析飞思卡尔PXN20 MCU:架构、外设与系统集成实战

1. 项目概述在嵌入式开发领域,尤其是汽车电子和高端工业控制应用中,选对一颗微控制器(MCU)只是第一步,真正决定项目成败的,往往是对这颗芯片“五脏六腑”的透彻理解。今天,我们就来深入拆解飞思…

2026/6/24 7:28:08阅读更多 →
Stateflow Active State Output:状态机对外通信与模块化设计的关键技术

Stateflow Active State Output:状态机对外通信与模块化设计的关键技术

1. 项目概述:Stateflow Active State Output 到底是什么?如果你用过Simulink/Stateflow做状态机建模,大概率遇到过这样的需求:在Simulink的顶层,你想直观地看到当前是哪个子状态在“当家做主”,或者想把这个…

2026/6/24 7:28:08阅读更多 →
20行Rust实现AI代码Agent骨架:基于A3S模型的轻量执行环

20行Rust实现AI代码Agent骨架:基于A3S模型的轻量执行环

1. 这不是“调用API”,而是亲手焊出一个AI代码Agent的骨架“20行代码,构建Claude Code核心能力”——看到这个标题,我第一反应是皱眉。不是因为做不到,而是因为太多人把“核心能力”误解成了“调用接口”。真正的核心,…

2026/6/24 7:23:07阅读更多 →
【人工智能】一文搞定到底什么是智能体

【人工智能】一文搞定到底什么是智能体

【人工智能】一文搞定到底什么是智能体 一文搞定到底什么是智能体【人工智能】一文搞定到底什么是智能体一. LM,WorkFlow,Agent分别有什么么不同二. Agent的思考过程是怎样的三. Agent的五个核心部分1)LLM2)Prompt3)Me…

2026/6/23 7:04:52阅读更多 →
嵌入式GUI控件实战:ROTARY、SCROLLBAR、SLIDER原理与应用

嵌入式GUI控件实战:ROTARY、SCROLLBAR、SLIDER原理与应用

1. 嵌入式GUI控件:从原理到实战的深度解析在嵌入式系统开发中,图形用户界面(GUI)的设计与实现往往是项目从“能用”到“好用”的关键一跃。不同于资源充沛的PC或移动平台,嵌入式设备的GUI需要在有限的CPU性能、内存空间…

2026/6/24 2:12:09阅读更多 →
Google AI Studio 300美元额度的真相与实战指南

Google AI Studio 300美元额度的真相与实战指南

1. 这300美金不是“送钱”,而是Google埋下的第一道技术门槛 你看到标题里那个醒目的“$300美金”时,第一反应可能是:又一个免费额度?领完就完事?我亲手试过——这300美金根本不是红包,而是一张入场券&…

2026/6/23 5:55:37阅读更多 →
TaskJuggler脚本编程入门:用代码实现自动化项目管理

TaskJuggler脚本编程入门:用代码实现自动化项目管理

TaskJuggler脚本编程入门:用代码实现自动化项目管理 【免费下载链接】TaskJuggler TaskJuggler - Project Management beyond Gantt chart drawing 项目地址: https://gitcode.com/gh_mirrors/ta/TaskJuggler TaskJuggler是一款强大的开源项目管理工具&#…

2026/6/24 0:02:41阅读更多 →
终极教程:使用angular-mobile-nav实现流畅的移动页面过渡效果

终极教程:使用angular-mobile-nav实现流畅的移动页面过渡效果

终极教程:使用angular-mobile-nav实现流畅的移动页面过渡效果 【免费下载链接】angular-mobile-nav An angular navigation service for mobile applications 项目地址: https://gitcode.com/gh_mirrors/an/angular-mobile-nav angular-mobile-nav是一款专为…

2026/6/24 0:02:41阅读更多 →
Wan2.1-Fun-V1.1-1.3B-InP Web UI使用教程:无需代码的AI视频创作

Wan2.1-Fun-V1.1-1.3B-InP Web UI使用教程:无需代码的AI视频创作

Wan2.1-Fun-V1.1-1.3B-InP Web UI使用教程:无需代码的AI视频创作 【免费下载链接】Wan2.1-Fun-V1.1-1.3B-InP 项目地址: https://ai.gitcode.com/hf_mirrors/PAI/Wan2.1-Fun-V1.1-1.3B-InP Wan2.1-Fun-V1.1-1.3B-InP是一款强大的AI视频创作工具,…

2026/6/24 0:02:41阅读更多 →