diracgrid / vmdirac Goto Github PK

View Code? Open in Web Editor NEW

6.0 10.0 25.0 1.46 MB

DIRAC VM module

Home Page: http://diracgrid.org

License: GNU General Public License v3.0

Python 82.58% JavaScript 12.42% CSS 0.02% Shell 4.99%

vmdirac's Introduction

VMDIRAC

The cloud extension for the DIRAC interware.

VMDIRAC is an extension to include cloud sites within the DIRAC framework. When there are queued jobs matching the cloud resource definition, VMDIRAC will start VMs to run those jobs. A variety of different cloud platforms and configurations are supported, full documentation is available here.

vmdirac's People

Contributors

Stargazers

Watchers

vmdirac's Issues

Something missing in VMDIRAC release tarballs for WebApp

The VMDIRAC tarballs don't have the VMDIRAC/WebApp/static/VMDIRAC/VMDirac/build folder, which the WebAppDIRAC has for example

WebAppDIRAC/WebApp/static/DIRAC/ResourceSummary/build/
WebAppDIRAC/WebApp/static/DIRAC/ResourceSummary/build/ResourceSummary.js
WebAppDIRAC/WebApp/static/DIRAC/ResourceSummary/build/ResourceSummary.js.gz
WebAppDIRAC/WebApp/static/DIRAC/ResourceSummary/build/index.html
WebAppDIRAC/WebApp/static/DIRAC/ResourceSummary/build/index.html.gz

Maybe there is something missing in the creation of the release tarballs? The webserver is looking only in the build folder for the VMDirac.js, so I made a link from classes to build on the certification machine to make the VMDIRAC monitor appear for the moment, but this will be gone soon again...

VMDIRAC Images CS Config

I propose to reshuffle a little bit the CS to make code and config simpler.
Now we have

Images
  Image1
    bootImageName
    contextMethod
    random strings
  Image2
    bootImageName
    contextMethod
    random strings

I propose to order it as follows

Images
  Image1
    bootImageName
    contextMethod = ssh
    ssh
      random strings
  Image2
    bootImageName
    contextMethod = adhoc
    adhoc
      random strings

What do you think @vmendez ?
I can prepare quickly a helper to make this kind of things completely transparent.

( a similar idea would apply for the endpoints )

Code analysis

I've run a code analysis, and before doing anything with VMDIRAC we have to fix few bugs. I will take care of it.

You can see the analysis here:
https://volhcb31.cern.ch/jenkins/view/DIRAC/job/VMDIRAC_integration/

Please, assign to myself.

Security properties for VMDIRAC

Properties for VirtualMachineManagerHandler
a) "VirtualMachine" those used by VirtualMachineMonitor from the VM
b) "VirtualMachineManager" those used by groups to allow different operations (dirac_admin -> stop, user -> view, anonymous ? )

Define the new properties in:
Core/Security/Properties.py

X509 Authtentication

authentiation would be per site basis on CS options:

siteAuth = user/proxy
1. user: then using the current user/pasword schema
2. proxy 509 is using a proxy of the VO to submit to such endpoint, which is in charge of VOMS validation.

Implementation and testing:
case use:

cc.in2p3.fr X509 nova keyauths using libcloud with a proxy DN member of FranceGrilles
cesga.es X509 rOCCI 1.1 with opennebula 3.8 with a proxy DN member of Ibegrid

To consider:
The user certificate on the proxy generation can be issue as: user cert, service cert, robot cert. Depending on the policies of the CAs and the corresponding V0s
Multi VO
A siteAutho = proxy should have a vmCertPath[ VO ] defined at site level (currently is defined at ..../Images/{contextMethod}, so VMScheduler have a look to the queued TQ job 'group' on the submission time momment, to V0 = group then sumbit the VM with the particular proxy of the VO. This VO should be also on the VM /LocalSite/ of dirac.cfg to matching only jobs of such VO.
The testing of multiple VO can be done with cc.in2p3.fr and cesga.es

Use of Cloud "Sites"

As agreed with point 2 of http://indico.cern.ch/getFile.py/access?sessionId=6&resId=0&materialId=1&confId=238601:

Every cloud resource should be defined in a way to be fully compatible with the existing DIRAC tools (e.g. WMS, TS, RSS). The CS /Resources section has to be used. We identified a direct parallelism between the "usual" sites definition, and the cloud one. The example is in the doc.

This task is just to make sure that this requirement is not forgotten, and that tests will be done to be sure we don't find surprises. It also means that RSS might need some adaptation.

put a buttom to stop VM to dirac_admin in new web

using existing rpc handlers in VMManager

Reorganize code

The VMDIRAC code needs a quick re-organization, to follow the same "directories" structure that DIRAC uses. In particular, Cloud drivers should go under VMDIRAC.Resources.

OCCI flavor/image

OcciImage.py should be modified to allow the flavor/image features

VMs fail when using DIRACOS

runit is used to start the VM monitor process in at least some cases. When the VM is running with DIRACOS, runit is not available so the monitor fails to start and the VMs get tidied up after a while. This manifests as longer jobs starting but eventually getting killed and put in to the "stalled" state.

check JDL files at VM or other means of job counting

using VMmonitor
This could be an issue if jdl files are not found at VM, because of github commit made by Andrei:
DIRACGrid/DIRAC@0c82f4f

Let OcciImage and Occi09 on WorkloadManagementSystem/Utilities/ sytle

folowing WorkloadManagementSystem/Utilities/
Configuration.py, wiht new OcciConfiguration
Context.py with new
if contextName == 'occi':
cls = OCCIContext
And OcciImage and Occi09 in similar way to NovaImage and Nova11

Try to import and catch properly in case of exception

In case a cloud client isn't installed, it's not possible to start the VMManager. Adding a try/except clause would allow to catch what client is/are available. Maybe the CS could hold that info...

Converging contextualization method to cloud-init

Amazon and OpenStack allready were supporting it.
OpenNebula 3.8 with econe metadata server it is now providing cloud-init
http://dev.opennebula.org/issues/1768

So we move for a standard cloud-init with user-data as a common "orchestator" script for all the skies, with a particular image & endpoint metadata context including url for particular contextualization scripts for dirac and cvmfs, to be download and run.

Installation instructions

I've written a quick stub with the basic operations needed to install VMDIRAC on my own Wiki. Would be nice to either merge it with DIRACGrid/VMDIRAC or adding them to the diracgrid.org documentation ( so far there is no room for extensions documentation on that portal )

Find them here

Openshift machine-api

Discussing with some people, I had the impression that openshift machine-api is doing just what VMDIRAC tries to do: spawn machines on various cloud providers. Maybe it is worth investigating https://github.com/openshift/machine-api-operator

VM Stoppage: multiple options

As agreed with point 1.2 of http://indico.cern.ch/getFile.py/access?sessionId=6&resId=0&materialId=1&confId=238601:

It has to be possible to select between 2 different stoppage algorithms: Time-driven and Never. There should be the possibility to switch from one to the other based on the targeted cloud.

Time-driven: the VM will shutdown after a (configurable) amount of time (JobAgent cycles) after which no jobs are matched.
Never: the VM never shutdowns.

Paramount to both is the Site-driven: if the site(cloud) will signal the VM to shutdown, we must be ready to oblige.

Considerations made in #16 also apply here.

Needed: Bulk inserts in DB

We need bulk insertions in the DB ( through the service ), in particular, for the heartbeats and the rest of the monitoring information. If not, we are limited to the 20 request/s limit.

MaxInstances = 112 not working for VirtualMachine Scheduler

Maybe it is considering only declared "running" VMs at the DIRAC server, not the scheduled, check it

Stalled VMs

I have an interesting behavior due to a wrong contextualization.

VMs are up and running. However, the VMonitorAgent does not work as expected ( the userdata was corrupted, do not ask why ). VMDIRAC never set them as running, they are Stalled. This is extremely dangerous because the JobAgent is running but we have absolutely no control over the VM. We need to put in place a mechanism to spot them and wipe them out asap.

Problem, we may kill the JobAgent while processing a job !

Related with this, I did not find "Stalled" state documented in the DB. It does not look to me as a final state, is it ?

Proposal: Remove Flash dependency in VMDIRAC charts

It is not easy with Flash and some Linux systems. What about to change Flash charts to some javascript based charts like Highcharts or other ?
https://github.com/highcharts/highcharts
/Rafal

Migrate all the specific cloud details to libcloud

Use libcloud to talk to the different providers

Automatize installation to extract parameters

'installation' has to be a Running Pod contextualization parameter for different setups (CERN, France Federated Cloud, Ibergrid portal...)

dirac-install
V: installation= : Installation from which to extract parameter values
should be available in
The installation value has to be declare in
http://lhcbproject.web.cern.ch/lhcbproject/dist/DIRAC3/globalDefaults.cfg

dirac-configure will use the previous defaults-whateverinstallation parameters

Consider the HEPiX per site stoppage

For me, this is not clear jet:
https://twiki.cern.ch/twiki/bin/view/LCG/WMTEGEnvironmentVariables
TODO: clarify the requirements specification, then implment it
Additionally this is only the CernVM requisites, it would be interesting to know what the sites think in how to implement this, f.e the IBEX cloud, as the moment there is no implementation of the linked HEPiX propsal
Also interesting to know how is going the work in progress to implement a prototype at CERN using LSF, to know more details.
Would it be possible anyone at CERN to investigating on this, please

Use libcloud:OpenNebula driver

Good news, we can easily integrate libcloud to run in OpenNebula. It is a piece of cake.

from libcloud.compute.types     import Provider
from libcloud.compute.providers import get_driver
cloudManagerAPI = get_driver( Provider.OPENNEBULA )
c = cloudManagerAPI( user, pass, host=hostName,port=portNumber, secure=False )
i = c.list_images()[5]
n = c.ex_list_networks()[1]
s= c.list_sizes()[2]
c.create_node( name = vmName, size=s, image=i, networks=n, context={ 'var' : 'test' })

The only "but" is the number of disks, but default libcloud assumes there is only one disk to be mounted. We would need to hack a bit libcloud as we need two disks at the moment ( one HDD plus another with the contextualization scripts )

from (libcloud/compute/drivers/opennebula +685)

        disk = ET.SubElement(compute, 'DISK')
        ET.SubElement(disk,
                      'STORAGE',
                      {'href': '/storage/%s' % (str(kwargs['image'].id))})

        disk = ET.SubElement(compute, 'DISK')
        if not isinstance(kwargs['image'], list):
                kwargs['image'] = [kwargs['image']]
        for image in kwargs[ 'image' ]:
                ET.SubElement(disk,
                      'STORAGE',
                      {'href': '/storage/%s' % (str(image.id))})

Let CloudStackImage and CloudStackInstance in WorkloadManagementSystem/Utilities/ style

folowing WorkloadManagementSystem/Utilities/
Configuration.py, should fit on current ad-hoc
Context.py with ad-hoc
And CloudStackImage and CloudStackInstance in similar way to NovaImage and Nova11

http references in Web components

Modern browsers will refuse to access unsecure http resources within an otherwise secure session.

VMDIRAC/Web/templates/systems/virtualmachines/overview.mako

Line 6 in b14ed23

${ h.javascript_link( "http://www.google.com/jsapi" ) }

VMDIRAC/Web/public/javascripts/systems/virtualmachines/vmInstanceInfoWindow.js

Line 141 in b14ed23

var imgSrc = "http://chart.apis.google.com/chart?"+ imgOps.join("&");

VMMonitor miscounts jobs

It seems to be saying zero in my instances; I think this may be due to the interaction with containers.

VM instantiation: multiple options

As agreed with point 1.1 of http://indico.cern.ch/getFile.py/access?sessionId=6&resId=0&materialId=1&confId=238601:

1.1) VM Instantiation:
It has to be possible to select between 2 different instantiation scheduling algorithm: Jobs-driven and Slots-driven. There should be the possibility to switch from one to the other based on the targeted cloud.

Jobs-driven: VMDIRAC will instantiate VM only when there are jobs in the task queues.
Slots-driven: VMDIRAC will instantiate VM independently from if there are jobs in the task queues. All the cloud "slots" will be taken

This will require that we know exactly how many "slots" are available. This is something to be added to the Cloud drivers (bear in mind #15)

Proposal: Modify the VMDIRAC configuration structure

The goal for this proposal is to simplify the work for the administrator.

`VirtualMachineScheduler` configuration

Do not need to set SubmitPools with runningpod names in VirtualMachineScheduler configuration. So the administrator does not need to write anything in the Systems -> WorkloadManagement -> Production -> Agents -> VirturlMachineScheduler.

The VirtualMachineScheduler can loop over all the allowed sites and see whether it is a cloud site.
Just find the "CloudEndpoint" section and check the "cloudDriver" config to see whether this is a good cloud endpoint.

`Sites` configuration

Move some of the VirtualMachine configurations to the Sites section. The purpose is to keep the consistency to cluster and grid sites configuration.

This is the proposed structure:

Sites
    CLOUD.IHEP.cn
        CloudEndpoint = openstack.ihep.ac.cn, opennebula.ihep.ac.cn
        CloudEndpoints
            OwnerGroup = cloud_group
            cvmfs_http_proxy = DIRECT
            openstack.ihep.ac.cn
                cloudDriver = nova-1.1
                URI = http://...
                auth = userpasswd
                maxInstances = 20
                Setup = BES_Production
                Queues
                    SL6-BOSS
                        bootImageName = sl65-bes
                        Flavor = m1.small
                        Context = ssh-standard
                        vmPolicy = elastic
                        vmStopPolicy = elastic
                        maxEndpointInstances = 15
                        priority = 1
                        VO = bes
                        CPUTime = 86400
                        Platform = Linux_x86_64_glibc-2.12
                        architecture = x86_64
                        OS = ScientificSL_Carbon_6.5
                    SL5-CEPC
                        bootImageName = sl5-cepc
                        Flavor = m1.medium
                        Context = cloudInit-standard
                        vmPolicy = static
                        vmStopPolicy = never
                        maxEndpointInstances = 10
                        priority = 3
                        VO = cepc
                        ...
            opennebula.ihep.ac.cn
                cloudDriver = rocci-1.1
                URI = http://...
                ...
VirtualMachines
    Contexts
        cloudinit-standard
            ContextMethod = cloudinit
            vmDiracContextURL = ...
            ...
        ssh-standard
            ContextMethod = ssh
            vmDiracContextURL = ...
            ...

CloudEndpoint is like the CE in cluster and grid sites. CloudEndpoint configurations are moved to specified endpoint.
Queues here include image, contextualization, priority, requirements, etc.
Requirements could be put any place under the site configuration. The scope for the requirement depends on where it is written.
maxInstances can be set under endpoint and image, like what VMDIRAC is doing now.
The maxInstances under endpoint can control the total instance number on this endpoint, including all images.
The maxInstances under image will control the instance related to the image.
What is your opinion about this configuration structure?

Image management

The above configuration does not include Images part in the VirtualMachines section. All the image properties are put under Sites section.
What about only put an image ID under the queue section and put all the image information seperately under VirtualMachines section for the future image management system?

Nova11 terminate instance

ping @vmendez

I'm refactoring Nova11 and NovaImage modules, and I have the following question:

Why Nova11 does not make use of libcloud.compute.openstack.OpenStackNodeDriver.ex_delete_image to terminate images ?

Refresh pilot version inside VMs

VMMonitorAgent must look after the PilotVersion being in sync between the CS and the local dirac.cfg.

In order to do so, I has to overwrite dirac.cfg and touch a file named "stop_agent" under all the JobAgents ( if more than one ) control directories. Please, assign to Victor. I will take care of the part that sets the version on the SetupProject call

diracgrid / vmdirac Goto Github PK

vmdirac's Introduction

VMDIRAC

vmdirac's People

Contributors

Stargazers

Watchers

Forkers

vmdirac's Issues

VirtualMachineScheduler configuration

Sites configuration

Image management

Recommend Projects

Recommend Topics

Recommend Org

`VirtualMachineScheduler` configuration

`Sites` configuration