Site and node registration. Learn on how to manipulate user accounts and roles, the site definition, the list of nodes and the ralted status on the GOCDB User documentation.
Site certification. How to test a candidate Grid site before putting it into production? [Read more]
Reporting. Every Friday a site report can be set to document the problems that affected the site during the past week, or to notify any issue of general interest. An email is automatically received as reminder, which specified how and when the report can be submitted. If nothing specific has to be documented, the report just needs to be validated by the site manager. A post-mortem needs to be included when downtimes are over to describe any issue that may be generally relevant.
Monitoring.
Ongoing monitoring. Grid nodes operated by a given site are subject to functional monitoring on a hourly basis. A subset of tests are considered "critical". Critical tests automatically trigger an alarm notification to the central operations team (optionally also to the site managers), who then proceed by filing a trouble ticket to the site to request intervention (see list of alarm tests). Among the alarm tests, only results for the CE, SE and BDII tests are considered for the computation of the montly availability/reliability of the site.
On-deman monitoring. On-deman monitoring sessions can be triggered as needed by the site manager through the SAM-ADMIN tool (the backup instance can be contacted at this link). SAM-ADMIN allows the site manager to send tests of choice for a selected list of Grid services. On-deman tests can be assit the site manager to check the status after an update or fix. Launching a new test session after a problem is solved, contributes to increase the availability/reliability of the site. Results of on-demand test sessions are published on the monitoring portal.
Availability and reliability. In case of poor montly availability/reliability (availability < 70% AND reliability < 75%) the site manager needs ot justify the reason of the poor performance recorded. This report needs to be sent to grid-operations (at) cnaf.infn.it. All reports are then collected by the Grid operations centre and made public centrally. The site needs to be daily monitored through the above-mentioned monitoring portal to ensure high availability. In order to improve the site reliability, downtimes need to be declared well in advance as scheduled downtimes do not inpact on the site reliability (see details below)
Downtime management. In case of planned interventions on one or more Grid services hosted by the site, a downtime needs to be declared on the GOC-DB. Planned downtimes need to be notified in advance. A downtime needs to be declared also in case of problems affecting the services or the site as a whole. Note: the motivation of the downtime needs to be properly documented, and a post mortem description of the intervention needs to be added for reporting reasons. A downtime can be:
Scheduled: if an intervention is planeed well in advance. A downtime is scheduled if declared on the GOC-DB 24 h in advance, specifying reason and duration. Best practice: for interventions that impact the end-users, declare a scheduled downtime 5 working days in advance, specifying reason and duration.
Unscheduled: in case of unplanned interventions usually triggered by an unexpected failure. Unscheduled downtimes are declared less than 24 h in advance.
Getting support. Site mangers can contact the Italian operatiosn centre in case of need for help for any Grid operational problem affecting the site. Various communication channels can be used:
(preferred) by opening a trouble ticket through the IGI helpdesk system for the Central Management Team (CMT) Department - login through your username/passwd or through personal certificate;
a Grid service whose middleware version is declared to be obsolete, needs to upgrade within one month from the notification of the national operations centre. After one month without updated, if the service is still online, the service is suspended. Middleware versions of Grid services are perdiodically monitored. For more information on the oldest middleware versions supported see here.
Site status. During the lifecycle of a Grid site, different statuses are possible.
Candidate: site has just been declared to GOCDB and information is still not complete.
Uncertified: the Operations Centre has validated the site info. The uncertified status would generally be an information that a site is ready to start certification procedure (again). It can also be used as a timewise unlimited state for sites having to keep an old version of the middleware.
Certified: the Operations Centre has verified that the site has all middleware installed, passes the tests and appears stable.
Suspended: Site does temporarily not conform to EGEE production requirements (e.g. EGEE SLAs, security matters) and requires attention. Suspended is always having a temporary meaning. It is used to flag a site temporarily not coping with with EGI quality requirements. When being suspended, sites can express that they want to pass certification again.
Closed: site is definitely no longer operated and is only shown for history reasons. The closed status should is the terminal one. [Read more]
This event comprises a collaboration between EGI and SHIWA for the conduct of two linked workshops about workflows in e-Science. It is anticipated that there is sufficient commonality in the attendance at each of the two workshops for them to be run 'back to back’ and thereby for each to benefit from enhanced attendance and greater impact.
The first workshop (on Thursday) is organised by the SHIWA project with the main goal to introduce and collect feedback about the SHIWA solutions that enable cross-workflow and inter-workflow exploitation of Distributed Computing Infrastructures.
The second workshop (Friday) is organised by EGI.eu and the Hungarian National Grid Infrastructure. This workshop seeks to bring together representatives of e-Science communities together with the providers of services, workflow technology and grid infrastructure. The objective is to discuss and further clarify the requirements of ‘Scientific workflow systems’ in order that such systems can be incorporated into the European Grid Infrastructure.