Manage your site

  • Site and node registration. Learn on how to manipulate user accounts and roles, the site definition, the list of nodes and the ralted status on the GOCDB User documentation.
  • Site certification. How to test a candidate Grid site before putting it into production? [Read more]
  • Reporting. Every Friday a site report can be set to document the problems that affected the site during the past week, or to notify any issue of general interest. An email is automatically received as reminder, which specified how and when the report can be submitted. If nothing specific has to be documented, the report just needs to be validated by the site manager. A post-mortem needs to be included when downtimes are over to describe any issue that may be generally relevant.
  • Monitoring.
    • Ongoing monitoring. Grid nodes operated by a given site are subject to functional monitoring on a hourly basis. A subset of tests are considered "critical". Critical tests automatically trigger an alarm notification to the central operations team (optionally also to the site managers), who then proceed by filing a trouble ticket to the site to request intervention (see list of alarm tests). Among the alarm tests, only results for the CE, SE and BDII tests are considered for the computation of the montly availability/reliability of the site.
    • On-deman monitoring. On-deman monitoring sessions can be triggered as needed by the site manager through the SAM-ADMIN tool (the backup instance can be contacted at this link). SAM-ADMIN allows the site manager to send tests of choice for a selected list of Grid services. On-deman tests can be assit the site manager to check the status after an update or fix. Launching a new test session after a problem is solved, contributes to increase the availability/reliability of the site.  Results of on-demand test sessions are published on the monitoring portal.
  • Availability and reliability. In case of poor montly availability/reliability (availability < 70% AND reliability < 75%) the site manager needs ot justify the reason of the poor performance recorded. This report needs to be sent to grid-operations (at) cnaf.infn.it. All reports are then collected by the Grid operations centre and made public centrally. The site needs to be daily monitored through the above-mentioned monitoring portal to ensure high availability. In order to improve the site reliability, downtimes need to be declared well in advance as scheduled downtimes do not inpact on the site reliability (see details below)
  • Downtime management. In case of planned interventions on one or more Grid services hosted by the site, a downtime needs to be declared on the GOC-DB. Planned downtimes need to be notified in advance. A downtime needs to be declared also in case of problems affecting the services or the site as a whole. Note: the motivation of the downtime needs to be properly documented, and a post mortem description of the intervention needs to be added for reporting reasons.  A downtime can be:
    1. Scheduled: if an intervention is planeed well in advance. A downtime is scheduled if declared on the GOC-DB 24 h in advance, specifying reason and duration. Best practice: for interventions that impact the end-users, declare a scheduled downtime 5 working days in advance, specifying reason and duration.
    2. Unscheduled: in case of unplanned interventions usually triggered by an unexpected failure. Unscheduled downtimes are declared less than 24 h in advance.
  • Getting support. Site mangers can contact the Italian operatiosn centre in case of need for help for any Grid operational problem affecting the site. Various communication channels can be used:
    1. (preferred) by opening a trouble ticket through the IGI helpdesk system for the Central Management Team (CMT) Department - login through your username/passwd or through personal certificate; 
    2. via e-mail: it-roc<at>infn.it;
    3. by opening a trouble ticket through the central Global Grid User Support system.
  • Site/Service suspension. A site/service can be suspended or change status to uncertified if one of the following conditions applies.
    1. sites in downtime for more than one month become uncertified
    2. non-intervention in case of high security and vulnerability risks affecting the site
    3. sites featuring persistent poor performance: availability <= 50% during 3 consecutive months
    4. a Grid service whose middleware version is declared to be obsolete, needs to upgrade within one month from the notification of the national operations centre. After one month without updated, if the service is still online, the service is suspended. Middleware versions of Grid services are perdiodically monitored. For more information on the oldest middleware versions supported see here.
  • Site status. During the lifecycle of a Grid site, different statuses are possible.
    1. Candidate: site has just been declared to GOCDB and information is still not complete.
    2. Uncertified: the Operations Centre has validated the site info. The uncertified status would generally be an information that a site is ready to start certification procedure (again). It can also be used as a timewise unlimited state for sites having to keep an old version of the middleware. 
    3. Certified: the Operations Centre has verified that the site has all middleware installed, passes the tests and appears stable.
    4. Suspended: Site does temporarily not conform to EGEE production requirements (e.g. EGEE SLAs, security matters) and requires attention.
      Suspended is always having a temporary meaning. It is used to flag a site temporarily not coping with with EGI quality requirements. When being suspended, sites can express that they want to pass certification again. 
    5. Closed: site is definitely no longer operated and is only shown for history reasons. The closed status should is the terminal one. [Read more]
  • Procedure manual. EGEE-III Operational Procedures manual for Operations Centres and sites.
  • Tools.
    1. The broadcast tool.
    2. Full list of tools for monitoring, accounting and management of a site.