furiosa-ai / device-api Goto Github PK
View Code? Open in Web Editor NEWAPIs that offers NPU devices' information and allow to control the devices
License: Apache License 2.0
APIs that offers NPU devices' information and allow to control the devices
License: Apache License 2.0
I suspect that this crate is entangled by the async
keyword, which is arguably rare in use cases. (but we have to manage async
and blocking
at the same time.)
See also #31. (If some fields of the Device
struct are to brought in a lazy way, should they be served in a async way?)
For example, the heartbeat
value in device.rs comes from device_info
structure. The heartbeat
value is updated every second, but the device_info
structure is not updated once created, so the correct heartbeat
value of the device cannot be obtained. I think it should be read file and return value when it requested such as Fetcher
in hwmon.rs.
Many ML engineers are more familiar with Python than Rust, so the device-api
Python binding is required. However, because of Rust's generic type in DeviceConfigBuilder
and EnvBuilder
, it is not trivial to provide the exactly same interface as Rust.
My points are these.
DeviceConfigBuilder
and EnvBuilder
from Python userDeviceConfig
As defined here, I suggest a constructor for DeviceConfig
with three arguments (arch, mode, count) instead of using DeviceConfigBuilder
to build a DeviceConfig
object in Python. And I suggest that make from_env
function to return DeviceConfig
object directly.
User may consider that the DeviceConfig
class is defined as below. They do not need to know anything about DeviceConfigBuilder
and EnvBuilder
in Rust.
from furiosa_device_python import Arch, DeviceMode
from dataclasses import dataclass
@dataclass
class DeviceConfig:
# default DeviceConfig is <Warboy, Fusion, count=1>
# https://github.com/furiosa-ai/device-api/blob/main/src/config/mod.rs#L88
arch: Arch = Arch.Warboy
mode: DeviceMode = DeviceMode.Fusion
count: int = 1
@classmethod
def from_env(cls, env: str):
return cls(...)
@classmethod
def from_str(cls, str: str):
return cls(...)
User can generate DeviceConfig
objects like this.
from furiosa_device_python import Arch, DeviceMode, DeviceConfig
# DeviceConfig::warboy().build()
config_default = DeviceConfig()
# DeviceConfig::warboy().multicore().build()
config_multicore = DeviceConfig(mode=DeviceMode.MultiCore)
# DeviceConfig::warboy().single().count(2)
config_single_count2 = DeviceConfig(mode=DeviceMode.Single, count=2)
# DeviceConfig::from_env("SOME_OTHER_ENV_KEY").build();
config_from_env = DeviceConfig.from_env("SOME_OTHER_ENV_KEY")
# DeviceConfig::from_str("warboy(2)*2")
config_from_str = DeviceConfig.from_str("warboy(2)*2")
In this way, all combinations of DeviceConfig
that can be created in Rust can also be created in Python, and I think that we can provide a more intuitive and easy interface to Python users.
Let's assume 2 warboy NPUs (namely, npu0
and npu1
) with a core of npu1
occupied.
(npu0) {0: Available, 1: Available}
(npu1) {0: Occupied("test"), 1: Available}
If someone is looking for an available singlecore devfile, it would be pleasant if find_devices(..)
returns npu1pe1
.
But current implementation returns npu0pe0
since it does not consider fragmentation.
[DeviceFile { device_index: 0, core_indices: [0], path: "/root/src/device-api/test_data/test-0/dev/npu0pe0", mode: Single }]
warboy-b0
is a production chip. So, warboy
is the most common expression for users to specify the NPU device in the production SDK. So, we need to change warboy
to specify to warboy-b0 actually, and we need to add warboy-a0
to specify the told revision.
Such as Toggle, there are some components that are exposed but not public. I think they should be visible to users or made private for release.
>>> config = DeviceConfig.from_str("warboy(2)*1,npu:0:0")
>>> dev_files = find_device_files(config)
>>> print(dev_files)
['npu0pe0-1', 'npu0pe0']
Even though we open only a single pe, the remaining single pe becomes OCCUPIED
state.
>>> config = DeviceConfig(arch=Arch.Warboy, mode=DeviceMode.Single, count=2)
>>> find_device_files(config)
[npu0pe0, npu0pe1]
>>> config = DeviceConfig(arch=Arch.Warboy, mode=DeviceMode.Single, count=1)
>>> find_device_files(config)
[npu0pe0]
>>> pe0 = open("/dev/npu0pe0")
>>> find_device_files(config)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
RuntimeError: Device warboy(1)*1 not found
Consider the following Python fragment (but the issue itself is unrelated to Python bindings):
import os, asyncio
from furiosa_device import *
async def main():
fd = os.open("/dev/npu1pe0-1", os.O_RDWR)
try:
files = await find_device_files(DeviceConfig.from_str("npu:1:0-1"))
print(files)
finally:
os.close(fd)
asyncio.run(main())
This will fail, as expected when there is only a single NPU with two PEs. However the following error message:
RuntimeError: Device npu:1:0-1 not found
...is misleading. as we do have npu:1:0-1
, we just cannot open them. It should read something like this instead (the actual exception type is subject to change):
RuntimeError: Device npu:1:0-1 found but still in use
Though the original message should remain when device is outright non-existent:
RuntimeError: Device npu:1:0-2 not found
The corresponding enum variant to DeviceError
should be also added.
Device busy can happen in some furiosa-runtime test cases when the runtime fails to initialize a new session for whatever reason, because the session needs to open the device file and the error might be reported before the device file has been actually closed and available for use again.
While this particular issue can be also "fixed" by delaying the error reporting until the device file is known to be closed, it feels wrong because the runtime may be unable to close the device file and that case wouldn't be indistinguishable from the closed case. In the other words the caller has no actual guarantee anyway (unless the runtime doesn't return, which would be absurd). So the session initialization should retry for a while instead, and you need to distinguish device busy in order to avoid useless retries (e.g. environment variable error).
It'll be nice if we can support a new interface like: DeviceConfig::from_env().or_default("..")
When using furiosa_device::blocking::list_devices
, sensor_container of Device.Fetcher is empty. Printing the fetcher from the non-blocking function (furiosa_device::list_devices
) shows:
Fetcher { device_index: 0, sensor_container: SensorContainer({
Power: [Sensor { name: "PCI Total RMS PWR", items: {"average": "/sys/bus/pci/devices/0000:4f:00.0/hwmon/hwmon4/power1_average"} }, Sensor { name: "NE Core RMS PWR", items: {"average": "/sys/bus/pci/devices/0000:4f:00.0/hwmon/hwmon4/power2_average"} }, Sensor { name: "NE PWR", items: {"average": "/sys/bus/pci/devices/0000:4f:00.0/hwmon/hwmon4/power3_average"} }, Sensor { name: "PCI 12V PWR", items: {"average": "/sys/bus/pci/devices/0000:4f:00.0/hwmon/hwmon4/power4_average"} }, Sensor { name: "PCI 3.3V PWR", items: {"average": "/sys/bus/pci/devices/0000:4f:00.0/hwmon/hwmon4/power5_average"} }, Sensor { name: "NE 12V PWR", items: {"average": "/sys/bus/pci/devices/0000:4f:00.0/hwmon/hwmon4/power6_average"} }],
Current: [Sensor { name: "NE Current", items: {"input": "/sys/bus/pci/devices/0000:4f:00.0/hwmon/hwmon4/curr1_input"} }, Sensor { name: "PCI 12V Curr", items: {"input": "/sys/bus/pci/devices/0000:4f:00.0/hwmon/hwmon4/curr2_input"} }, Sensor { name: "PCI 3.3V Curr", items: {"input": "/sys/bus/pci/devices/0000:4f:00.0/hwmon/hwmon4/curr3_input"} }, Sensor { name: "NE 12V Curr", items: {"input": "/sys/bus/pci/devices/0000:4f:00.0/hwmon/hwmon4/curr4_input"} }],
Temperature: [Sensor { name: "Peak", items: {"input": "/sys/bus/pci/devices/0000:4f:00.0/hwmon/hwmon4/temp1_input"} }, Sensor { name: "Average", items: {"input": "/sys/bus/pci/devices/0000:4f:00.0/hwmon/hwmon4/temp2_input"} }, Sensor { name: "U74M", items: {"input": "/sys/bus/pci/devices/0000:4f:00.0/hwmon/hwmon4/temp3_input"} }, Sensor { name: "LPDDR4", items: {"input": "/sys/bus/pci/devices/0000:4f:00.0/hwmon/hwmon4/temp4_input"} }, Sensor { name: "PCIE", items: {"input": "/sys/bus/pci/devices/0000:4f:00.0/hwmon/hwmon4/temp5_input"} }, Sensor { name: "NE", items: {"input": "/sys/bus/pci/devices/0000:4f:00.0/hwmon/hwmon4/temp6_input"} }, Sensor { name: "NE_PE0", items: {"input": "/sys/bus/pci/devices/0000:4f:00.0/hwmon/hwmon4/temp7_input"} }, Sensor { name: "NE_PE1", items: {"input": "/sys/bus/pci/devices/0000:4f:00.0/hwmon/hwmon4/temp8_input"} }, Sensor { name: "NE_TOP", items: {"input": "/sys/bus/pci/devices/0000:4f:00.0/hwmon/hwmon4/temp9_input"} }, Sensor { name: "AMBIENT", items: {"input": "/sys/bus/pci/devices/0000:4f:00.0/hwmon/hwmon4/temp10_input"} }],
Voltage: [Sensor { name: "NE Core Volt", items: {"input": "/sys/bus/pci/devices/0000:4f:00.0/hwmon/hwmon4/in0_input"} }, Sensor { name: "NE Core 48V Volt", items: {"input": "/sys/bus/pci/devices/0000:4f:00.0/hwmon/hwmon4/in1_input"} }]}) }
However, this is the fetcher from blocking feature (furiosa_device::blocking::list_devices
).
Fetcher { device_index: 0, sensor_container: SensorContainer({}) }
There's no documentation about textual device config. We need to document it.
https://docs.rs/furiosa-device/0.1.1/furiosa_device/struct.DeviceConfig.html#
Add device management API for accessing metadata and running functions.
(for https://github.com/furiosa-ai/device-api/blob/b0647e0b2ac27b987df75e07dae1ecf1490d4d55/src/sysfs.rs like
Line 81 in b0647e0
readable
writable
The furiosa_device.sync
module is not working properly. Running examples/monitoring sync.py
shows nothing.
NPU: npu3
======= CURRENTS =======
======= VOLTAGES =======
======= POWERS =======
======= TEMPERATURES =======
In order to perform some checks before publishing, I tried a command as below (at branch-0.1.0) and it fails:
sy@sukyoungjeongui-MacBookPro device-api % cargo publish --dry-run
Updating crates.io index
Packaging furiosa-device v0.1.0 (/Users/sy/src/data/device-api)
error: failed to prepare local package for uploading
Caused by:
failed to open for archiving: `/Users/sy/src/data/device-api/test_data/test-0/sys/class/npu_mgmt/npu0pe0`
Caused by:
No such file or directory (os error 2)
find_devices
returns Result<Vec<DeviceFile>, _>
, but DeviceFile has only path, index and mode. It would be great if we can get more abundant information like DeviceInfo
, firmware_version
included in Device.
If the error type DeviceResult
implements StdError (https://doc.rust-lang.org/stable/std/error/trait.Error.html) trait, it would have various benefits; especially compatibility with various error crates like eyre
.
This is a tracking issue for patch release 0.1.1.
Some features or tests are easier to test in Python than in Rust (ex. DeviceBusy error). So we need python test in device-api CI.
Specific device IDs are commonly used in many cases. It would be great if we can get DeviceFiles with various information from specific device IDs.
Few test cases in npu-tools
utilize send
/recv
commands of a Multicore device.
Successively from #35 , we need to support DeviceConfig to express named Multicore device.
warboy*2
와 같이 (N)
이 없는 경우 multi-core 모드로 동작합니다. multi-core 는 조금 더 특별한 경우기 때문에 명시적으로 특별한 notation 이 필요하다고 생각합니다.
(N)
없이 장치 이름만 지정하는 경우 1개 pe 에 대응되도록 변경합니다.*
이 여러 의미로 쓰이는 것을 방지하고 더 직관적인 표현을 위해 기존 * 를 x 로 변경합니다.
변경전:
변경후:
@libc-furiosa @sukyoungjeong-furiosa 논의를 시작해보고자 하는 이슈이니 편히 의견 주시면 감사하겠습니다.
Currently, device-api provides only async/await APIs. However, using APIs outside in async would be more common in many cases. We need to support the blocking APIs too.
The following tasks are to be done before publishing on crates.io:
cf. https://doc.rust-lang.org/cargo/reference/publishing.html
전에 말씀드린 텍스트 형태로 device config를 비롯한 다양한 device 설정을 text로 하기 위한 제안입니다. 우선 순수한 제안이고요. 우선순위가 급하지는 않지만 대략적으로 Q2 내에는 이런 기능이 추가되어야 하지 않을까 생각합니다.
유스케이스가 다음과 같습니다.
우선 제안 드리는 텍스트 표현은 다음과 같습니다. FURIOSA_DEVICES
는 NPU_DEVNAME
을 대체하는 새로운 환경변수 이름입니다.
# Using specific device names
FURIOSA_DEVICES="0:0" # npu0pe0
FURIOSA_DEVICES="0:0-1" # npu0pe0-1
# Using device configs
FURIOSA_DEVICES="warboy*2" # warboy multi core mode x 2
FURIOSA_DEVICES="warboy(1)*2" # single pe x 2
FURIOSA_DEVICES="warboy(2)*2" # 2-pe fusioned x 2
# Using device configs with a random device
# It can be commonly used because most of systems will have a single kind of NPUs.
FURIOSA_DEVICES="npu(2)*2" # any 2-pe fusioned device x 2
# When we use multiple models in a single application
FURIOSA_DEVICES="APP1=warboy*2, APP2=warboy(2)*2" # Allow to specify two different device configurations for two applications 'APP1' and 'APP2'
It would be useful if we can retrieve NUMA node ID associated with the NPU device's PCI lane.
Currently, one can retrieve NUMA node ID as follows:
root@sukyoungjeong-npu-0:~# cat /sys/class/npu_mgmt/npu5_mgmt/busname
0000:ce:00.0
root@sukyoungjeong-npu-0:~# cat /sys/bus/pci/devices/0000\:ce\:00.0/numa_node
1
>>> config = DeviceConfig(arch=Arch.Warboy, mode=DeviceMode.Single, count=1)
>>> find_device_files(config)
[npu0pe0, npu0pe1]
device-api/device-api/src/config/mod.rs
Lines 58 to 67 in 3bae030
The types here are Result
, so it is not guaranteed that they are examples of valid syntax.
// All available devices
let config = DeviceConfig::warboy().all();
The current implementation has a potential bug when processing comma-separated configurations including the "implicit" form.
"warboy(1)*1,0:0"
device-api should support the simplified text form like npu:0:0
, npu:0:0-1
, and we should introduce this form rather than the old form npu0pe0-1
to users in 0.10.0.
Currently we only have WarboyConfigBuilder
for building DeviceConfig
, we need similar struct for Renegade.
Also there's some nit problems with its usage such as:
let config = DeviceConfig::warboy().fused().multicore().fused(); // code like this is permitted
With the current implementation, mgmt_files
are read at once.
Unnecessary reads occur when only few fields of DeviceInfo are needed.
Moreover, the existence of unstable channels can unintentionally affect the system.
How about not reading anything during initialization, and reading values when needed?
I'm wondering if the results of DeviceConfig's FromStr and Display are symmetric. If its symmetricity is guaranteed, transforming a string to a DeviceConfig and vice versa would be much convenient.
furiosa_device_api
is too long to use, and the suffix api
seems to be unnecessary for a library implementation. So, I'd like to suggest renaming the crate name 'furiosa-device-api' to 'furiosa-device'.
// current
let config = DeviceConfig::default();
furiosa_device_api::find_devices(&config);
// new
let config = DeviceConfig::default();
furiosa_device::find_devices(&config).await;
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.