Git Product home page Git Product logo

device-api's People

Contributors

bg-furiosa avatar dependabot[bot] avatar hyunsik avatar libc-furiosa avatar n0gu-furiosa avatar sukyoungjeong-furiosa avatar yw-furiosa avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

device-api's Issues

Are Async APIs necessary?

I suspect that this crate is entangled by the async keyword, which is arguably rare in use cases. (but we have to manage async and blocking at the same time.)

See also #31. (If some fields of the Device struct are to brought in a lazy way, should they be served in a async way?)

Real time values ​​not updating

For example, the heartbeat value in device.rs comes from device_info structure. The heartbeat value is updated every second, but the device_info structure is not updated once created, so the correct heartbeat value of the device cannot be obtained. I think it should be read file and return value when it requested such as Fetcher in hwmon.rs.

[PROPOSAL] Python interface for `DeviceConfig`

Problem

Many ML engineers are more familiar with Python than Rust, so the device-api Python binding is required. However, because of Rust's generic type in DeviceConfigBuilder and EnvBuilder, it is not trivial to provide the exactly same interface as Rust.

Proposed solution

My points are these.

  • Hide DeviceConfigBuilder and EnvBuilder from Python user
  • Provide more pythonic interface of DeviceConfig

As defined here, I suggest a constructor for DeviceConfig with three arguments (arch, mode, count) instead of using DeviceConfigBuilder to build a DeviceConfig object in Python. And I suggest that make from_env function to return DeviceConfig object directly.

User may consider that the DeviceConfig class is defined as below. They do not need to know anything about DeviceConfigBuilder and EnvBuilder in Rust.

from furiosa_device_python import Arch, DeviceMode
from dataclasses import dataclass

@dataclass
class DeviceConfig:
    # default DeviceConfig is <Warboy, Fusion, count=1>
    # https://github.com/furiosa-ai/device-api/blob/main/src/config/mod.rs#L88
    arch: Arch = Arch.Warboy
    mode: DeviceMode = DeviceMode.Fusion
    count: int = 1

    @classmethod
    def from_env(cls, env: str):
        return cls(...)

    @classmethod
    def from_str(cls, str: str):
        return cls(...)

User can generate DeviceConfig objects like this.

from furiosa_device_python import Arch, DeviceMode, DeviceConfig

# DeviceConfig::warboy().build()
config_default = DeviceConfig()

# DeviceConfig::warboy().multicore().build()
config_multicore = DeviceConfig(mode=DeviceMode.MultiCore)

# DeviceConfig::warboy().single().count(2)
config_single_count2 = DeviceConfig(mode=DeviceMode.Single, count=2)

# DeviceConfig::from_env("SOME_OTHER_ENV_KEY").build();
config_from_env = DeviceConfig.from_env("SOME_OTHER_ENV_KEY")

# DeviceConfig::from_str("warboy(2)*2")
config_from_str = DeviceConfig.from_str("warboy(2)*2")

In this way, all combinations of DeviceConfig that can be created in Rust can also be created in Python, and I think that we can provide a more intuitive and easy interface to Python users.

`find_devices` should consider fragmentation

Let's assume 2 warboy NPUs (namely, npu0 and npu1) with a core of npu1 occupied.

(npu0) {0: Available, 1: Available}
(npu1) {0: Occupied("test"), 1: Available}

If someone is looking for an available singlecore devfile, it would be pleasant if find_devices(..) returns npu1pe1.

But current implementation returns npu0pe0 since it does not consider fragmentation.

[DeviceFile { device_index: 0, core_indices: [0], path: "/root/src/device-api/test_data/test-0/dev/npu0pe0", mode: Single }]

Duplicated device file returned bug

>>> config = DeviceConfig.from_str("warboy(2)*1,npu:0:0")
>>> dev_files = find_device_files(config)
>>> print(dev_files)
['npu0pe0-1', 'npu0pe0']

another single pe becomes occupied even though only a single pe is opened.

Problem

Even though we open only a single pe, the remaining single pe becomes OCCUPIED state.

How to reproduce

>>> config = DeviceConfig(arch=Arch.Warboy, mode=DeviceMode.Single, count=2)
>>> find_device_files(config)
[npu0pe0, npu0pe1]
>>> config = DeviceConfig(arch=Arch.Warboy, mode=DeviceMode.Single, count=1)
>>> find_device_files(config)
[npu0pe0]
>>> pe0 = open("/dev/npu0pe0")
>>> find_device_files(config)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: Device warboy(1)*1 not found

Distinguish device busy from other errors

Consider the following Python fragment (but the issue itself is unrelated to Python bindings):

import os, asyncio
from furiosa_device import *

async def main():
    fd = os.open("/dev/npu1pe0-1", os.O_RDWR)
    try:
        files = await find_device_files(DeviceConfig.from_str("npu:1:0-1"))
        print(files)
    finally:
        os.close(fd)

asyncio.run(main())

This will fail, as expected when there is only a single NPU with two PEs. However the following error message:

RuntimeError: Device npu:1:0-1 not found

...is misleading. as we do have npu:1:0-1, we just cannot open them. It should read something like this instead (the actual exception type is subject to change):

RuntimeError: Device npu:1:0-1 found but still in use

Though the original message should remain when device is outright non-existent:

RuntimeError: Device npu:1:0-2 not found

The corresponding enum variant to DeviceError should be also added.

Motivation

Device busy can happen in some furiosa-runtime test cases when the runtime fails to initialize a new session for whatever reason, because the session needs to open the device file and the error might be reported before the device file has been actually closed and available for use again.

While this particular issue can be also "fixed" by delaying the error reporting until the device file is known to be closed, it feels wrong because the runtime may be unable to close the device file and that case wouldn't be indistinguishable from the closed case. In the other words the caller has no actual guarantee anyway (unless the runtime doesn't return, which would be absurd). So the session initialization should retry for a while instead, and you need to distinguish device busy in order to avoid useless retries (e.g. environment variable error).

sensor_container is empty when get device from blocking feature

When using furiosa_device::blocking::list_devices, sensor_container of Device.Fetcher is empty. Printing the fetcher from the non-blocking function (furiosa_device::list_devices) shows:

Fetcher { device_index: 0, sensor_container: SensorContainer({
Power: [Sensor { name: "PCI Total RMS PWR", items: {"average": "/sys/bus/pci/devices/0000:4f:00.0/hwmon/hwmon4/power1_average"} }, Sensor { name: "NE Core RMS PWR", items: {"average": "/sys/bus/pci/devices/0000:4f:00.0/hwmon/hwmon4/power2_average"} }, Sensor { name: "NE PWR", items: {"average": "/sys/bus/pci/devices/0000:4f:00.0/hwmon/hwmon4/power3_average"} }, Sensor { name: "PCI 12V PWR", items: {"average": "/sys/bus/pci/devices/0000:4f:00.0/hwmon/hwmon4/power4_average"} }, Sensor { name: "PCI 3.3V PWR", items: {"average": "/sys/bus/pci/devices/0000:4f:00.0/hwmon/hwmon4/power5_average"} }, Sensor { name: "NE 12V PWR", items: {"average": "/sys/bus/pci/devices/0000:4f:00.0/hwmon/hwmon4/power6_average"} }], 
Current: [Sensor { name: "NE Current", items: {"input": "/sys/bus/pci/devices/0000:4f:00.0/hwmon/hwmon4/curr1_input"} }, Sensor { name: "PCI 12V Curr", items: {"input": "/sys/bus/pci/devices/0000:4f:00.0/hwmon/hwmon4/curr2_input"} }, Sensor { name: "PCI 3.3V Curr", items: {"input": "/sys/bus/pci/devices/0000:4f:00.0/hwmon/hwmon4/curr3_input"} }, Sensor { name: "NE 12V Curr", items: {"input": "/sys/bus/pci/devices/0000:4f:00.0/hwmon/hwmon4/curr4_input"} }], 
Temperature: [Sensor { name: "Peak", items: {"input": "/sys/bus/pci/devices/0000:4f:00.0/hwmon/hwmon4/temp1_input"} }, Sensor { name: "Average", items: {"input": "/sys/bus/pci/devices/0000:4f:00.0/hwmon/hwmon4/temp2_input"} }, Sensor { name: "U74M", items: {"input": "/sys/bus/pci/devices/0000:4f:00.0/hwmon/hwmon4/temp3_input"} }, Sensor { name: "LPDDR4", items: {"input": "/sys/bus/pci/devices/0000:4f:00.0/hwmon/hwmon4/temp4_input"} }, Sensor { name: "PCIE", items: {"input": "/sys/bus/pci/devices/0000:4f:00.0/hwmon/hwmon4/temp5_input"} }, Sensor { name: "NE", items: {"input": "/sys/bus/pci/devices/0000:4f:00.0/hwmon/hwmon4/temp6_input"} }, Sensor { name: "NE_PE0", items: {"input": "/sys/bus/pci/devices/0000:4f:00.0/hwmon/hwmon4/temp7_input"} }, Sensor { name: "NE_PE1", items: {"input": "/sys/bus/pci/devices/0000:4f:00.0/hwmon/hwmon4/temp8_input"} }, Sensor { name: "NE_TOP", items: {"input": "/sys/bus/pci/devices/0000:4f:00.0/hwmon/hwmon4/temp9_input"} }, Sensor { name: "AMBIENT", items: {"input": "/sys/bus/pci/devices/0000:4f:00.0/hwmon/hwmon4/temp10_input"} }], 
Voltage: [Sensor { name: "NE Core Volt", items: {"input": "/sys/bus/pci/devices/0000:4f:00.0/hwmon/hwmon4/in0_input"} }, Sensor { name: "NE Core 48V Volt", items: {"input": "/sys/bus/pci/devices/0000:4f:00.0/hwmon/hwmon4/in1_input"} }]}) }

However, this is the fetcher from blocking feature (furiosa_device::blocking::list_devices).

Fetcher { device_index: 0, sensor_container: SensorContainer({}) }

Implement metadata access and function execution

Add device management API for accessing metadata and running functions.
(for https://github.com/furiosa-ai/device-api/blob/b0647e0b2ac27b987df75e07dae1ecf1490d4d55/src/sysfs.rs like

pub fn busname(&mut self) -> Option<&str> {
)

  • readable

    • alive (-> Bool)
    • atr_error (-> Map)
    • busname (-> String)
    • dev (-> String)
    • device_state (-> Bool)
    • device_type (-> String)
    • device_uuid (-> String)
    • evb_rev (-> String)
    • fw_version (-> String)
    • heartbeat (-> Number)
    • ne_clk_freq_info (-> Map)
    • ne_dtm_policy (-> enum)
    • performance_level (-> enum)
    • performance_mode (-> enum)
    • platform_type (-> String)
    • soc_rev (-> String)
    • soc_uid (-> String)
    • version (-> String)
  • writable

    • device_led ([Bool;3] ->)
    • ne_clock (Bool ->)
    • ne_dtm_policy (enum ->)
    • performance_level (enum ->)
    • performance_mode (enum ->)

fix python `furiosa_device.sync` module

The furiosa_device.sync module is not working properly. Running examples/monitoring sync.py shows nothing.

NPU: npu3
======= CURRENTS =======

======= VOLTAGES =======

======= POWERS =======

======= TEMPERATURES =======

`cargo publish` fails due to broken symlinks

In order to perform some checks before publishing, I tried a command as below (at branch-0.1.0) and it fails:

sy@sukyoungjeongui-MacBookPro device-api % cargo publish --dry-run
    Updating crates.io index
   Packaging furiosa-device v0.1.0 (/Users/sy/src/data/device-api)
error: failed to prepare local package for uploading

Caused by:
  failed to open for archiving: `/Users/sy/src/data/device-api/test_data/test-0/sys/class/npu_mgmt/npu0pe0`

Caused by:
  No such file or directory (os error 2)

Add python test to CI

Some features or tests are easier to test in Python than in Rust (ex. DeviceBusy error). So we need python test in device-api CI.

DeviceConfig for named multicore device

Few test cases in npu-tools utilize send/recv commands of a Multicore device.

Successively from #35 , we need to support DeviceConfig to express named Multicore device.

warboy*2 should be equivalent(1) to warboy(1)*2?

Background

warboy*2 와 같이 (N) 이 없는 경우 multi-core 모드로 동작합니다. multi-core 는 조금 더 특별한 경우기 때문에 명시적으로 특별한 notation 이 필요하다고 생각합니다.

Proposal

  • pe 개수 (N) 없이 장치 이름만 지정하는 경우 1개 pe 에 대응되도록 변경합니다.
  • warboy(*) 를 multi-core 를 가리키는 표현으로 사용합니다.
  • *이 여러 의미로 쓰이는 것을 방지하고 더 직관적인 표현을 위해 기존 * 를 x 로 변경합니다.
    • 예) warboy(1)x2

변경전:

  • warboy*2 => warboy multicore mode x 2
  • warboy(1)*2 => warboy 1 pe x 2
  • warboy(2)*2 => warboy fusioned 2 pe x 2

변경후:

  • warboy x 2 => warboy 1 pe x 2
  • warboy(1)x2 => warboy 1 pe x 2
  • warboy(2)x2 => warboy fusioned 2 pe x 2
  • warboy(*)x2 => warboy multicore mode x 2 (더 좋은 notation 이 있으면 제안 주셔도 좋습니다)

@libc-furiosa @sukyoungjeong-furiosa 논의를 시작해보고자 하는 이슈이니 편히 의견 주시면 감사하겠습니다.

Need blocking APIs

Currently, device-api provides only async/await APIs. However, using APIs outside in async would be more common in many cases. We need to support the blocking APIs too.

[PROPOSAL] A textual representation of device configurations

전에 말씀드린 텍스트 형태로 device config를 비롯한 다양한 device 설정을 text로 하기 위한 제안입니다. 우선 순수한 제안이고요. 우선순위가 급하지는 않지만 대략적으로 Q2 내에는 이런 기능이 추가되어야 하지 않을까 생각합니다.

유스케이스가 다음과 같습니다.

  • 서빙 프레임워크의 설정 파일
  • 작성된 응용에서 사용할 디바이스를 코드 변경 없이 외부에서 설정할 때

우선 제안 드리는 텍스트 표현은 다음과 같습니다. FURIOSA_DEVICESNPU_DEVNAME을 대체하는 새로운 환경변수 이름입니다.

# Using specific device names
FURIOSA_DEVICES="0:0" # npu0pe0
FURIOSA_DEVICES="0:0-1" # npu0pe0-1

# Using device configs
FURIOSA_DEVICES="warboy*2" # warboy multi core mode x 2
FURIOSA_DEVICES="warboy(1)*2" # single pe x 2
FURIOSA_DEVICES="warboy(2)*2" # 2-pe fusioned x 2

# Using device configs with a random device 
# It can be commonly used because most of systems will have a single kind of NPUs.
FURIOSA_DEVICES="npu(2)*2" # any 2-pe fusioned device x 2

# When we use multiple models in a single application
FURIOSA_DEVICES="APP1=warboy*2, APP2=warboy(2)*2" # Allow to specify two different device configurations for two applications 'APP1' and 'APP2'

Retrieving NUMA node ID

It would be useful if we can retrieve NUMA node ID associated with the NPU device's PCI lane.

cf)
image

Currently, one can retrieve NUMA node ID as follows:

root@sukyoungjeong-npu-0:~# cat /sys/class/npu_mgmt/npu5_mgmt/busname
0000:ce:00.0
root@sukyoungjeong-npu-0:~# cat /sys/bus/pci/devices/0000\:ce\:00.0/numa_node
1

docstring on DeviceConfig is not tested properly

/// DeviceConfig::from_str("0:0"); // npu0pe0
/// DeviceConfig::from_str("0:0-1"); // npu0pe0-1
///
/// // Using device configs
/// DeviceConfig::from_str("warboy*2"); // single pe x 2 (equivalent to "warboy(1)*2")
/// DeviceConfig::from_str("warboy(1)*2"); // single pe x 2
/// DeviceConfig::from_str("warboy(2)*2"); // 2-pe fusioned x 2
///
/// // Combine multiple representations separated by commas
/// DeviceConfig::from_str("0:0-1, 1:0-1"); // npu0pe0-1, npu1pe0-1

The types here are Result, so it is not guaranteed that they are examples of valid syntax.

Builder struct for `DeviceConfig` needs enhancement

Currently we only have WarboyConfigBuilder for building DeviceConfig, we need similar struct for Renegade.

Also there's some nit problems with its usage such as:

let config = DeviceConfig::warboy().fused().multicore().fused(); // code like this is permitted

suggestion for lazy loading of mgmt_files

With the current implementation, mgmt_files are read at once.
Unnecessary reads occur when only few fields of DeviceInfo are needed.
Moreover, the existence of unstable channels can unintentionally affect the system.

How about not reading anything during initialization, and reading values when needed?

Rename the crate name from 'furiosa-device-api' to 'furiosa-device'.

furiosa_device_api is too long to use, and the suffix api seems to be unnecessary for a library implementation. So, I'd like to suggest renaming the crate name 'furiosa-device-api' to 'furiosa-device'.

// current
let config = DeviceConfig::default();
furiosa_device_api::find_devices(&config);

// new
let config = DeviceConfig::default();
furiosa_device::find_devices(&config).await;

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.