Git Product home page Git Product logo

vmdirac's Introduction

VMDIRAC

The cloud extension for the DIRAC interware.

VMDIRAC is an extension to include cloud sites within the DIRAC framework. When there are queued jobs matching the cloud resource definition, VMDIRAC will start VMs to run those jobs. A variety of different cloud platforms and configurations are supported, full documentation is available here.

Basic tests

vmdirac's People

Contributors

acasajus avatar aebeda3 avatar andresailer avatar arrabito avatar atsareg avatar fstagni avatar graciani avatar igorpelevanyuk avatar jaimeibar avatar marianne013 avatar mikewallace1979 avatar mirguest avatar myco avatar rkuchumov avatar sfayer avatar sposs avatar ubeda avatar vfalbor avatar vmendez avatar xianghuzhao avatar zhangxiaomei avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

vmdirac's Issues

Something missing in VMDIRAC release tarballs for WebApp

The VMDIRAC tarballs don't have the VMDIRAC/WebApp/static/VMDIRAC/VMDirac/build folder, which the WebAppDIRAC has for example

WebAppDIRAC/WebApp/static/DIRAC/ResourceSummary/build/
WebAppDIRAC/WebApp/static/DIRAC/ResourceSummary/build/ResourceSummary.js
WebAppDIRAC/WebApp/static/DIRAC/ResourceSummary/build/ResourceSummary.js.gz
WebAppDIRAC/WebApp/static/DIRAC/ResourceSummary/build/index.html
WebAppDIRAC/WebApp/static/DIRAC/ResourceSummary/build/index.html.gz

Maybe there is something missing in the creation of the release tarballs? The webserver is looking only in the build folder for the VMDirac.js, so I made a link from classes to build on the certification machine to make the VMDIRAC monitor appear for the moment, but this will be gone soon again...

VMDIRAC Images CS Config

I propose to reshuffle a little bit the CS to make code and config simpler.
Now we have

Images
  Image1
    bootImageName
    contextMethod
    random strings
  Image2
    bootImageName
    contextMethod
    random strings

I propose to order it as follows

Images
  Image1
    bootImageName
    contextMethod = ssh
    ssh
      random strings
  Image2
    bootImageName
    contextMethod = adhoc
    adhoc
      random strings

What do you think @vmendez ?
I can prepare quickly a helper to make this kind of things completely transparent.

( a similar idea would apply for the endpoints )

Security properties for VMDIRAC

Properties for VirtualMachineManagerHandler
a) "VirtualMachine" those used by VirtualMachineMonitor from the VM
b) "VirtualMachineManager" those used by groups to allow different operations (dirac_admin -> stop, user -> view, anonymous ? )

Define the new properties in:
Core/Security/Properties.py

X509 Authtentication

authentiation would be per site basis on CS options:

  • siteAuth = user/proxy
    1. user: then using the current user/pasword schema
    2. proxy 509 is using a proxy of the VO to submit to such endpoint, which is in charge of VOMS validation.

Implementation and testing:
case use:

  1. cc.in2p3.fr X509 nova keyauths using libcloud with a proxy DN member of FranceGrilles
  2. cesga.es X509 rOCCI 1.1 with opennebula 3.8 with a proxy DN member of Ibegrid

To consider:
The user certificate on the proxy generation can be issue as: user cert, service cert, robot cert. Depending on the policies of the CAs and the corresponding V0s
Multi VO
A siteAutho = proxy should have a vmCertPath[ VO ] defined at site level (currently is defined at ..../Images/{contextMethod}, so VMScheduler have a look to the queued TQ job 'group' on the submission time momment, to V0 = group then sumbit the VM with the particular proxy of the VO. This VO should be also on the VM /LocalSite/ of dirac.cfg to matching only jobs of such VO.
The testing of multiple VO can be done with cc.in2p3.fr and cesga.es

Use of Cloud "Sites"

As agreed with point 2 of http://indico.cern.ch/getFile.py/access?sessionId=6&resId=0&materialId=1&confId=238601:

Every cloud resource should be defined in a way to be fully compatible with the existing DIRAC tools (e.g. WMS, TS, RSS). The CS /Resources section has to be used. We identified a direct parallelism between the "usual" sites definition, and the cloud one. The example is in the doc.

This task is just to make sure that this requirement is not forgotten, and that tests will be done to be sure we don't find surprises. It also means that RSS might need some adaptation.

Reorganize code

The VMDIRAC code needs a quick re-organization, to follow the same "directories" structure that DIRAC uses. In particular, Cloud drivers should go under VMDIRAC.Resources.

OCCI flavor/image

OcciImage.py should be modified to allow the flavor/image features

VMs fail when using DIRACOS

runit is used to start the VM monitor process in at least some cases. When the VM is running with DIRACOS, runit is not available so the monitor fails to start and the VMs get tidied up after a while. This manifests as longer jobs starting but eventually getting killed and put in to the "stalled" state.

Converging contextualization method to cloud-init

Amazon and OpenStack allready were supporting it.
OpenNebula 3.8 with econe metadata server it is now providing cloud-init
http://dev.opennebula.org/issues/1768

So we move for a standard cloud-init with user-data as a common "orchestator" script for all the skies, with a particular image & endpoint metadata context including url for particular contextualization scripts for dirac and cvmfs, to be download and run.

Installation instructions

I've written a quick stub with the basic operations needed to install VMDIRAC on my own Wiki. Would be nice to either merge it with DIRACGrid/VMDIRAC or adding them to the diracgrid.org documentation ( so far there is no room for extensions documentation on that portal )

Find them here

VM Stoppage: multiple options

As agreed with point 1.2 of http://indico.cern.ch/getFile.py/access?sessionId=6&resId=0&materialId=1&confId=238601:

It has to be possible to select between 2 different stoppage algorithms: Time-driven and Never. There should be the possibility to switch from one to the other based on the targeted cloud.

  • Time-driven: the VM will shutdown after a (configurable) amount of time (JobAgent cycles) after which no jobs are matched.
  • Never: the VM never shutdowns.

Paramount to both is the Site-driven: if the site(cloud) will signal the VM to shutdown, we must be ready to oblige.

Considerations made in #16 also apply here.

Needed: Bulk inserts in DB

We need bulk insertions in the DB ( through the service ), in particular, for the heartbeats and the rest of the monitoring information. If not, we are limited to the 20 request/s limit.

Stalled VMs

I have an interesting behavior due to a wrong contextualization.

VMs are up and running. However, the VMonitorAgent does not work as expected ( the userdata was corrupted, do not ask why ). VMDIRAC never set them as running, they are Stalled. This is extremely dangerous because the JobAgent is running but we have absolutely no control over the VM. We need to put in place a mechanism to spot them and wipe them out asap.

Problem, we may kill the JobAgent while processing a job !

Related with this, I did not find "Stalled" state documented in the DB. It does not look to me as a final state, is it ?

Consider the HEPiX per site stoppage

For me, this is not clear jet:
https://twiki.cern.ch/twiki/bin/view/LCG/WMTEGEnvironmentVariables
TODO: clarify the requirements specification, then implment it
Additionally this is only the CernVM requisites, it would be interesting to know what the sites think in how to implement this, f.e the IBEX cloud, as the moment there is no implementation of the linked HEPiX propsal
Also interesting to know how is going the work in progress to implement a prototype at CERN using LSF, to know more details.
Would it be possible anyone at CERN to investigating on this, please

Use libcloud:OpenNebula driver

Good news, we can easily integrate libcloud to run in OpenNebula. It is a piece of cake.

from libcloud.compute.types     import Provider
from libcloud.compute.providers import get_driver
cloudManagerAPI = get_driver( Provider.OPENNEBULA )
c = cloudManagerAPI( user, pass, host=hostName,port=portNumber, secure=False )
i = c.list_images()[5]
n = c.ex_list_networks()[1]
s= c.list_sizes()[2]
c.create_node( name = vmName, size=s, image=i, networks=n, context={ 'var' : 'test' })

The only "but" is the number of disks, but default libcloud assumes there is only one disk to be mounted. We would need to hack a bit libcloud as we need two disks at the moment ( one HDD plus another with the contextualization scripts )

from (libcloud/compute/drivers/opennebula +685)

        disk = ET.SubElement(compute, 'DISK')
        ET.SubElement(disk,
                      'STORAGE',
                      {'href': '/storage/%s' % (str(kwargs['image'].id))})

to

        disk = ET.SubElement(compute, 'DISK')
        if not isinstance(kwargs['image'], list):
                kwargs['image'] = [kwargs['image']]
        for image in kwargs[ 'image' ]:
                ET.SubElement(disk,
                      'STORAGE',
                      {'href': '/storage/%s' % (str(image.id))})

VMMonitor miscounts jobs

It seems to be saying zero in my instances; I think this may be due to the interaction with containers.

VM instantiation: multiple options

As agreed with point 1.1 of http://indico.cern.ch/getFile.py/access?sessionId=6&resId=0&materialId=1&confId=238601:

1.1) VM Instantiation:
It has to be possible to select between 2 different instantiation scheduling algorithm: Jobs-driven and Slots-driven. There should be the possibility to switch from one to the other based on the targeted cloud.

  • Jobs-driven: VMDIRAC will instantiate VM only when there are jobs in the task queues.
  • Slots-driven: VMDIRAC will instantiate VM independently from if there are jobs in the task queues. All the cloud "slots" will be taken

This will require that we know exactly how many "slots" are available. This is something to be added to the Cloud drivers (bear in mind #15)

Proposal: Modify the VMDIRAC configuration structure

The goal for this proposal is to simplify the work for the administrator.

VirtualMachineScheduler configuration

Do not need to set SubmitPools with runningpod names in VirtualMachineScheduler configuration. So the administrator does not need to write anything in the Systems -> WorkloadManagement -> Production -> Agents -> VirturlMachineScheduler.

The VirtualMachineScheduler can loop over all the allowed sites and see whether it is a cloud site.
Just find the "CloudEndpoint" section and check the "cloudDriver" config to see whether this is a good cloud endpoint.

Sites configuration

Move some of the VirtualMachine configurations to the Sites section. The purpose is to keep the consistency to cluster and grid sites configuration.

This is the proposed structure:

Sites
    CLOUD.IHEP.cn
        CloudEndpoint = openstack.ihep.ac.cn, opennebula.ihep.ac.cn
        CloudEndpoints
            OwnerGroup = cloud_group
            cvmfs_http_proxy = DIRECT
            openstack.ihep.ac.cn
                cloudDriver = nova-1.1
                URI = http://...
                auth = userpasswd
                maxInstances = 20
                Setup = BES_Production
                Queues
                    SL6-BOSS
                        bootImageName = sl65-bes
                        Flavor = m1.small
                        Context = ssh-standard
                        vmPolicy = elastic
                        vmStopPolicy = elastic
                        maxEndpointInstances = 15
                        priority = 1
                        VO = bes
                        CPUTime = 86400
                        Platform = Linux_x86_64_glibc-2.12
                        architecture = x86_64
                        OS = ScientificSL_Carbon_6.5
                    SL5-CEPC
                        bootImageName = sl5-cepc
                        Flavor = m1.medium
                        Context = cloudInit-standard
                        vmPolicy = static
                        vmStopPolicy = never
                        maxEndpointInstances = 10
                        priority = 3
                        VO = cepc
                        ...
            opennebula.ihep.ac.cn
                cloudDriver = rocci-1.1
                URI = http://...
                ...
VirtualMachines
    Contexts
        cloudinit-standard
            ContextMethod = cloudinit
            vmDiracContextURL = ...
            ...
        ssh-standard
            ContextMethod = ssh
            vmDiracContextURL = ...
            ...
  • CloudEndpoint is like the CE in cluster and grid sites. CloudEndpoint configurations are moved to specified endpoint.
  • Queues here include image, contextualization, priority, requirements, etc.
    Requirements could be put any place under the site configuration. The scope for the requirement depends on where it is written.
  • maxInstances can be set under endpoint and image, like what VMDIRAC is doing now.
    The maxInstances under endpoint can control the total instance number on this endpoint, including all images.
    The maxInstances under image will control the instance related to the image.
  • What is your opinion about this configuration structure?

Image management

The above configuration does not include Images part in the VirtualMachines section. All the image properties are put under Sites section.
What about only put an image ID under the queue section and put all the image information seperately under VirtualMachines section for the future image management system?

Nova11 terminate instance

ping @vmendez

I'm refactoring Nova11 and NovaImage modules, and I have the following question:

Why Nova11 does not make use of libcloud.compute.openstack.OpenStackNodeDriver.ex_delete_image to terminate images ?

Refresh pilot version inside VMs

VMMonitorAgent must look after the PilotVersion being in sync between the CS and the local dirac.cfg.

In order to do so, I has to overwrite dirac.cfg and touch a file named "stop_agent" under all the JobAgents ( if more than one ) control directories. Please, assign to Victor. I will take care of the part that sets the version on the SetupProject call

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.