This site requires JavaScript to be enabled

Notifications

24 views

Intended for: Jobsub users

 


Scenario/Use case:

The Jobsub server makes the output logs (stdout, stderr, etc) from completed jobs available to view through a web browser for 30 days. Links to these logs for a specific job can be found in Fifemon. These instructions will allow you to view logs from your own or a colleague's (depending on collaboration) jobs.


Instructions:

Acquire your FNAL CILogon Silver CA certificate and load it into your browser. See KB0013239

Locate your job in Fifemon (or if you already have a link to the Jobsub server, from POMS, email, etc, proceed to the next step):

  1. Select your user name in the User Batch Details or User Batch History dashboard
  2. Click on the job ID in the "Job Clusters" table. You may have to narrow down the time range to when your jobs finished using the time picker in the upper right corner
  3. This will take you to the Job Cluster Summary dashboard for that job, which includes various statistics on the jobs. There should also be orange buttons near the top, labeled "View sandbox files" and "Analyze jobs in Kibana".
    1. Clicking "View sandbox files" will take you to the complete listing of log files on the Jobsub server
    2. Clicking "Analyze jobs in Kibana" will take you to a Kibana dashboard from which you can view individual job details and filter jobs based on exit code and more. It also includes direct links to the stderr and stdout logs from each job.

When you follow a link to the Jobsub server, your browser will prompt you to present a certificate. Select the CILogon certificate you obtained in step 1. If you are not prompted to select a certificate, please verify that your certificate was properly loaded into the browser in step 1.

You may be presented a warning that the server is presenting an untrusted or invalid certificate; this is because the Jobsub servers use a certificate issued by the OSG CILogon certificate authority (CA), which is not included in the standard browser CA packs. You can proceed and add an exception for the jobsub servers. It's possible to manually add the OSG CILogon certificate as a trusted CA, for more information see KB0010758

If you get a "User authorization has failed" error from the server:

This can mean one of several things:

  1. Your proxy was not found in the MyProxy server, which is usually handled automatically by jobsub_submit, but may fail in some circumstances, notably for production users where a managed proxy is used to submit jobs. To manually add your proxy to MyProxy log into an interactive node (e.g. dunegpvm01.fnal.gov) and run the following:
    source /cvmfs/fermilab.opensciencegrid.org/products/common/etc/setups.sh
    setup cigetcert
    cigetcert -s fifebatch.fnal.gov
    You may have to repeat this procedure up to once a week, since proxies stored in MyProxy expire after seven days. Jobsub developers are working on supporting Single-Sign-On (SSO) to alleviate this issue for users in the future.
  2. You are trying to view another user's jobs, and:
    1. You are not a member of the user's collaboration, or you do not have your account set up for the experiment. You can request computer accounts and access be set up for you (requires collaboration approval).
    2. Your collaboration has not elected to make job logs viewable to all users. If your collaboration wants this enabled, the liaison or computing coordinator should open a request through the Service Desk.
    3. You are not a Jobsub "superuser" for your experiment. To add a user as a superuser, the liaison or computing coordinator should open a request through the Service Desk.
    4. Note that production jobs are usually run as a special production user, e.g. dunepro, so your collaboration needs to have logs viewable to all users, or you need to be a "superuser," in order to view production logs online. If these are not possible, then you have to use jobsub_fetchlog --role=Production to download a tarball of the complete logs for a submission, and you cannot view them online.
  3. Another error has occurred. Please contact the Service Desk and open a ticket to Distributed Computing Support.

 


See Also:

Fifemon

Certificates at Fermilab

Jobsub Client User Guides