View on GitHub

Computational Techniques for Life Sciences

Part of the TACC Institute Series, Immersive Training in Advanced Computation

Best Practices in Data Management and collaboration

Course Objectives

This course is designed as an introduction to managing data on high performance computing clusters. It is intended to teach users to be self-sufficient and proactive in managing their own data in order to (1) increase their productivity, (2) maximize the usage of the existing storage, (3) decrease accidental or unintentional data loss, and (4) prevent disruption of the file system or compute nodes to the point where it may affect others. This course is intended for people who have beginner to intermediate familiarity with a command line interface, and have active accounts on a TACC HPC cluster.

This course is divided into four modules:

Overview of HPC Data Management
Navigating a File System
Moving and Backing up Data
Tips and Tricks for Maximizing File System Usage

Instructional Objectives

This course should be taught in a room equipped with computers (or attendees with laptops) and internet access. Attendees should also have an existing allocation on a TACC resource. Attendees without an allocation can still participate in most components if they have a Mac / Linux laptop, or a Windows laptop with Putty installed and access to a Linux server.

Specific Learning Objectives

Module 1

Module 1: Overview of HPC Data Management

Topics covered in this module:
Why do we need best practices in data management? Types of file systems used in HPC (NFS, GPFS/Lustre, LTFS, RAID). Specific examples of storage infrastructures (TACC Stockyard, Ranch, Corral, WORK, SCRATCH). Active vs. inactive data. Staging data for compute, analysis, long term storage.
Attendees should be able to...
List multiple reasons for good data management. Describe the similarities and differences of distributed and parallel file systems. Determine whether data is backed up or vulnerable. Differentiate between active and inactive data. Identify the appropriate storage spaces for data at different stages of its life cycle.

Module 2

Module 2: Navigating a File System

Topics covered in this module:
Introduction to file system limitations (file size, number of files / inodes). Determining the size and age of a file. Determining the size of a directory, local disk usage (`du, du -h`). Total amount of free disk space (`df, df -h`). Checking your quota (`quota, lfs quota`).
Attendees should be able to...
Recognize file size and number limitations for a given file system. Measure and report the size and age of a file and directory. Measure and report disk usage. Measure and report the total amount of free disk space. Find different storage system mounted to a HPC cluster. Determine their disk quota for various file systems.

Module 3

Module 3: Moving and Backing up Data

Topics covered in this module:
How old is your data and when is it time to archive it? Zipping and archiving files and directories (`gzip, gzip -r9, gunzip, tar -czf, tar -xzf`). Transferring data from point to point (`rsync, scp, sftp, WinSCP`). Staging data on Ranch tape filesystem for archiving.
Attendees should be able to...
Judge whether it is appropriate to keep active or archive data. Zip and unzip files with gzip / gunzip. Compress and extract archives with tar. Transfer data efficiently between systems. Stage data on a tape file systems for transferring to and from.

Module 4

Module 4: Tips and Tricks for Maximizing File System Usage

Topics covered in this module:
Best practices in directory organization; dates and file names. Get rid of duplicate copies of data and duplications within data (dedupe). Share files with colleagues instead of duplicating them (permissions - `chmod`, `chown`, `chgrp`; access control lists; `ln`). Don’t install common software in your home directory.
Attendees should be able to...
Differentiate organized vs. unorganized data. Describe strategies to keep data well organized; especially in the context of job submission. Remove duplicate copies of data where appropriate. Set the correct permissions for sharing data where appropriate.

Next: Data Management

UP: Data Management Overview

Top: Course Overview