hitsz-ids / duetector Goto Github PK
View Code? Open in Web Editor NEWduetector🔍: Data Usage Extensible Detector for data usage observability.
Home Page: https://dataucon.idslab.io/
License: Apache License 2.0
duetector🔍: Data Usage Extensible Detector for data usage observability.
Home Page: https://dataucon.idslab.io/
License: Apache License 2.0
Currently it is difficult to access class documentation, and it is best to deploy readthedocs for developers to access the documentation.
I'm currently focused on feature development and may not have time to prioritize this ahead of time. Feel free to make a PR.
Cookiecutter is a Python package, easily installable with pip or other package managers, that enables you to create and use templates for microservices and software projects.
I have searched for issues similar to this one.
Reference examples/extension to build cookiecutter templates.
Add a README to examples/extension to tell people that there are cookiecutter templates available.
(Cookiecutter-template) Please fork and draft PR on this project: https://github.com/hitsz-ids/duetector-cookiecutter.
I have searched for issues similar to this one.
Prior to the official release, we should perform some benchmarks of performance(and others) to clarify the non-functional requirements for the next phase.
These are some of the things that may need to be done:
We have @WYXsb to help us on this. Also looking forward to any relevant proposal and help.
Thanks!
I have searched for issues similar to this one.
Support this draft: #84
Report: https://github.com/hitsz-ids/duetector/blob/main/docs/draft/case-mnist/report.md
You need to understand chinese as the draft(report) is written in chinese.
I have searched for issues similar to this one.
OpenTelemetry is an Observability framework
The tentative plan is to introduce it around v0.2.0.
I think we can use it to replace the current tracking
implementation as well as implement a generic collector
We plan to demonstrate the capabilities of this project by having a production-level case run-through by the time of the 0.1.0 release. There is currently a draft, see tracking-mljob-in-kata-containers.
For now, we have the simplest case open count, based on the experience of writing this case, I think we still have the following to accomplish:
Any refinements to this draft or other use cases are encouraged!
I have searched for issues similar to this one.
We currently provide users with documentation in English and are continuously updating it.
If people can help us with translations or even contribute to translations on an ongoing basis (I know it's a bit early to say), it would be beneficial to advance the promotion of this project!
Docs here: https://github.com/hitsz-ids/duetector/tree/main/docs
_zh
suffixPlease note source/
is not included in the translation.
This is an example of a translated README.md document
<p align="center">
<a href="./README.md">English</a> | <a href="./README_zh.md">中文</a>
</p>
Policy Information Point (PIP): Serves as the retrieval source of attributes, or the data required for policy evaluation to provide the information needed by the PDP to make the decisions. NIST SP 800-162.
In version 0.1.0 we will provide the base PIP service, which should not be limited to the existing tracer
and collector
implementations, and I think we should introduce an indirection layer to isolate the PIP service from the existing trace
/collection
mechanisms. And maybe this indirection layer is what's called an analyzer
.
I have searched for issues similar to this one.
As bcc
is not a CO-RE framework of eBPF, we need to rely on other framework like libbpf, aya-rs, cilium/ebpf, which are not written in Python. On the other hand, there may be other, non-eBPF, tracing programs that run as separate processes.
I think we can introduce a set of mechanisms for sub-processes as a way to achieve integration with other detectors.
Note that we have currently implemented monitor for shell command called ShMonitor
and a process daemon Daemon
.
We still have the following to move forward:
stdout
)SubprocessMonitor
SubprocessTracer
classNot sure this is beneficial or could benefit from #25.
See draft: #44
I have searched for issues similar to this one.
Now that we have SubprocessMonitor
in #44 , we can migrate the existing BccTracer
to CO-RE, using Subprocess proto
We're going to have multiple PRs for this.
OpenTracer
to CO-RETcpconnectTracer
to CO-RESee design: https://github.com/hitsz-ids/duetector/blob/main/docs/design/CO-RE.md
Main Features:
Unittest:
PR(Will merge into main):
We needs many tracers using eBPF to track operations on data. There is no tracers to trace read and write operations on files. So we need read
and write
tracers.
Add new read
and write
bcctracer just like other tracers here.
1.See our Developer Manual
2.Look at implentation of other tracers.
3.Fork our project and use BCC to write bcctracers under our framework and test them locally.
4.Create a PR.
We currently use bcc as our BPF framework, which creates some shortcomings: https://github.com/hitsz-ids/duetector/blob/main/docs/design/CO-RE.md#12-status-quo
This issue will migrate the current CloneTracer
to CO-RE form to validate our draft and provide a case for CO-RE!
您好, 个人理解这个项目主要是解决了数据提供方对提供数据给用数方之后, 用数方可能存在对数据滥用和泄露的问题, 从方案上看应该是可行的. 但是使用这种技术, 对用数方的环境是否有很大的侵入, 怎么解决用数方对部署该系统的安全担忧?
一方面, 如果该系统完全开源, 由用数方自行部署, 怎么确定用数方不会擅自修改代码改变了逻辑?
另一方面, 如果系统是个黑盒, 用数方怎样信任部署该系统? 因为该系统能够监听所有流量, 实际上拥有了很高级别的权限, 用数方怎样信任该系统不会盗取用数方的数据成为数据提供方的特洛伊木马? 因为数据交易流通场景下, 更多的是相互共享数据的诉求, 而不是单方向的提供数据.
我翻阅了下这里提供的文档链接, 但没有找到更详细的说明, 因此通过这个方式请教一下. 谢谢.
In order to get more people into the program faster, we came up with this Issue. Thank you for clicking on this issue and thinking about how you can contribute to our project. 🎉
Good First Issues empowers first-time contributors of open-source software.
We have three levels of difficulty for you:
Participation in a project that has already been assigned is also encouraged, but you must be aware that it will require more effort to get up.
When you have some problems and want to ask for help:
If you have any new ideas, including Feature Requests, feel free to raise an issue as a good first issue, and the maintainer will analyze the problem and evaluate the difficulty with you.
There are currently two maintainers responsible for maintaining this project:
We currently provide a very early startup script, which I hope will have the following functionality
duectl-daemon start
duectl-server-daemon start
JupyterLab
), mostly exec
one script, which can be defined by the environment variableFirst, I installed it according to the command
pip install duetector
in the readme.md this step performed normally.
When using the
sudo duectl start
command to start, it prompts:
ModuleNotFoundError: No module named 'bcc'
pip install duetector
sudo duectl start
It shows :
File "/Users/user_name/opt/anaconda3/lib/python3.9/site-packages/duetector/monitors/bcc_monitor.py", line 66, in init
from bcc import BPF # noqa
ModuleNotFoundError: No module named 'bcc'
(base) ➜ git_DIR
Paste complete error message, logs, or stack traces here.
(base) ➜ git_DIR sudo duectl start
Password:
2023-11-03 10:38:37.271 | INFO | duetector.config:generate_config:93 - Creating default config file /Users/user_name/.config/duetector/config.toml
2023-11-03 10:38:37.272 | INFO | duetector.config:load_config:114 - Loading config from /Users/user_name/.config/duetector/config.toml
2023-11-03 10:38:37.273 | INFO | duetector.config:load_env_config:145 - Loading config from environment variables, prefix: `DUETECTOR_`, sep: `__`
2023-11-03 10:38:37.274 | INFO | duetector.config:dump_config:170 - Current config has been dumped to /private/tmp/duetector_config.toml.31588
2023-11-03 10:38:37.413 | INFO | duetector.managers.collector:init:63 - Collector DequeCollector is disabled
Traceback (most recent call last):
File "/Users/user_name/opt/anaconda3/bin/duectl", line 8, in <module>
sys.exit(cli())
File "/Users/user_name/opt/anaconda3/lib/python3.9/site-packages/click/core.py", line 1128, in __call__
return self.main(*args, **kwargs)
File "/Users/user_name/opt/anaconda3/lib/python3.9/site-packages/click/core.py", line 1053, in main
rv = self.invoke(ctx)
File "/Users/user_name/opt/anaconda3/lib/python3.9/site-packages/click/core.py", line 1659, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/Users/user_name/opt/anaconda3/lib/python3.9/site-packages/click/core.py", line 1395, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/Users/user_name/opt/anaconda3/lib/python3.9/site-packages/click/core.py", line 754, in invoke
return __callback(*args, **kwargs)
File "/Users/user_name/opt/anaconda3/lib/python3.9/site-packages/duetector/cli/main.py", line 159, in start
monitors.append(BccMonitor(c))
File "/Users/user_name/opt/anaconda3/lib/python3.9/site-packages/duetector/monitors/bcc_monitor.py", line 58, in __init__
self.init()
File "/Users/user_name/opt/anaconda3/lib/python3.9/site-packages/duetector/monitors/bcc_monitor.py", line 66, in init
from bcc import BPF # noqa
ModuleNotFoundError: No module named 'bcc'
(base) ➜ git_DIR
Paste the contents of your configuration file here.
Add any other context about the problem here.
I have searched for issues similar to this one.
Currently we don't have a good way to test BccTracer
(and its subclasses), the main difficulty is that there are no test suites and test methods with predictable results.
I think we can simulate booting the kernel with a tool such as qemu
and then perform a series of predictable operations to get predictable results.
We already have experience related to compiling bcc images
TDB
I have searched for issues similar to this one.
Currently, since the runc
container shares kernel with each other and host, information about the host will be collected under the runc
container
we need a way to filter it to support cloud-native environments.
A duetector running inside a container should only report on the in-container process.
The duetector running on the host should be able to distinguish whether a process is coming from a container or not.
Reference:
https://github.com/cilium/cilium
https://github.com/deepflowio/deepflow
Currently bcc does not provide cgroup
related helper, a possible idea is to get the cgroup knid through task_struct
:
task_struct->cgroups->subsys[CGROUP_SUBSYS_COUNT]->cgroup->kn->id.id
But at the same time we want to get more information for tracking and analysis. For example, a docker's cgrop:
0::/system.slice/docker-56f9992608a558ef5dbe28317de44f3459dd5968035e30508a3f1c160bb5744b.scope
56f9992608a558ef5dbe28317de44f3459dd5968035e30508a3f1c160bb5744b
is container's id
Once we have the information about the process crgoup, we can clearly conclude whether the process is running in a runc container or not.
So we need to get more information from the user-space program. Or maybe we can get all the information we need from userspace alone, without relying on the ebpf
One possible way is cat /proc/{pid}/cgroup
. However, given the current trigger mechanism of poller, we may not be able to get readability information for short-lived processes. And once we've implement #44, front-end programs can then run with lower latency.
@all-contributors
please add @wunder957 for code.
please add @WYXsb for code.
In 0.0.2, We plan to bring the following functionality to achieve the goal of monitoring a particular machine learning task
We'll explore how the design ties together surveillance information, if you have any ideas, please feel free to share and discuss them with us!
In #28 we found that current tracer
's API is not support attatching multiple C functions to the BPF which made it hard to tracking a connection/thread's life cycle
I think we cloud add attatch: List[Tuple[attatch_type, attatch_args]]
to BccTracer
as an advanced way to make it more flexible.
In #103, we introduced Injector for more information of a process
Now we need to support query these in analyzer
I have searched for issues similar to this one.
We currently support OpenTelemetry's Collector(#82), We currently support OpenTelemetry's Collector, but there's no documentation on how to configure it yet
I would like someone to help us by testing it on different telemetry systems and give us the configuration documentation and the results of the test.
otlp-grpc
otlp-http
jaeger-thrift
(This is a compatibility option, as jaeger is already natively supported otlp)jaeger-grpc
(This is a compatibility option, as jaeger is already natively supported otlp)zipkin-http
zipkin-json
prometheus
Please comment and let me know which telemetry system you would like to participate in so we can discuss this in more depth!
This project was initially directed towards unplugged detection of data usage behaviour through eBPF technology. I'm glad we've initially implemented a framework for it. But want to make the probe results available to other applications (e.g. the Data Usage Controller of DataUcon project), we need to expose the results of our recording in a machine-readable format.
On the other hand, we need to finish standardising the storage side of things, and for large numbers of events, a traditional SQL database is not a good choice.
We don't yet have a good production example to represent our capabilities.
OpenTelemetry is sought after by related projects as an open source standard for observability. We believe that although our project is far from observability in terms of observables, goals, and functions. However, our project is similar to OpenTelemetry related projects in terms of technical implementation, and we should be able to benefit from the development of OpenTelemtry and related backends.
As the project has evolved, we have completed the integration with OpenTelemetry: #82. Next, we will make OpenTelemetry our primary support, and SQL databases MAY NOT be actively maintained.
We are currently using jaeger as the first backend to access the.
We will natively support monitoring of containers on the cloud, so let's start with the docker and k8s.
We will first build a querier for the jaeger backend to restore the tracer data from the backend, and then implement an analytics engine that can form an analysis of the tracer data to derive a picture of how the process is using the data. We will refer to this process as the measurement of data usage
We previously accepted a machine learning case for MNIST that included analysis and associated probing points for data usage behaviours: #84, and I thought we could start with this case to demonstrate our data usage measurement capabilities
Instead of (at least not in the near future) splitting the project into a queryer and a detector, we'll build two different images based on the same Python package(duetector). We already have a different CLI entry point, so I'm sure this won't be difficult.
In addition, we need to optimise the README document and the design document a bit, assuming the backend to be OpenTelemetry
This EPIC will be released as version 1.0.0
, prior to which the features described above will be integrated as version 0.x.y
and in a gradual development process.
Regarding data use measurability, I am working on some related blogs (in Chinese).
@all-contributors
please add @wunder957 for code.
please add @WYXsb for code.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.