Site Map Online Directory
  Search Information Technology   Northwestern University  
YOU ARE HERE >   NUIT > SSCC > Bulletins > Using gzipped Sequential SAS Datasets
Using gzipped Sequential SAS Datasets

About the SSCC

Cluster Report (NU Restricted)

HOWTOs

Bulletins

Statistical Software

Statistical Software Manuals

Additional Resources

Migration Information

Social Science Data Services

Kellogg Research Computing

Depot File Service

Improving Social Science Research Computing (PDF)

Contact List

Services

Get Connected

Support

Educational Resources

NUIT

Using gzipped Sequential SAS Datasets


An important new feature in SAS 6.12 for UNIX is the ability to work with gzipped sequential SAS datasets. Prior to version 6.12, SAS datasets could only be used in their uncompressed form. Now they can be created in a form suitable for use with gzip and gzcat.

If you are working with large SAS datasets, you should read and understand this bulletin.

You should also know how to use gzipped raw data files with SAS. See the bulletin SAS_PIPES.

Three examples are given below: SAS datasets can be very large and take up a lot of disk space. We have found that gzip compression frequently shrinks a SAS dataset by 85% to 95%! The ability to work with gzipped SAS datasets will now make it possible to work with much larger SAS datasets than ever before.

The key to this new feature is to use sequential SAS datasets. By default, if you don't do anything special, SAS creates datasets that are designed to be used non-sequentially. You can program SAS to jump around your dataset, accessing cases in a non-sequential manner.

Most statistical applications work in a sequential manner, and so can use sequential SAS datasets without any special consideration. The non-sequential access feature is not needed.

The gzip and gzcat programs process files sequentially. To compress a file, gzip reads the file from beginning to end, outputing compressed records one after another. To uncompress a file, gzcat reverses the process, and writes uncompressed records in a way that re-creates the file from beginning to end. Neither program is able to skip around in a file, compressing or decompressing randomly specified parts of the file.

Convert a gzipped SAS Dataset into a gzipped Sequential SAS Dataset

Most people keep their SAS datasets in gzipped form, and gunzip them to work on them. When finished, they gzip them up again. This example is designed to show you how to convert those original gzipped SAS datasets into new sequential SAS datasets that are always stored in gzipped form.

A key step is the construction of the FILENAME statement. Your gzip command MUST END with an ampersand, or your SAS job will hang up, doing nothing.

Adapt the following SAS commands to your needs. The non-sequential SAS dataset ~/revenue/prindata.ssd01.gz is converted into sequential ~/revenue/prndata.ssd.gz.

/* SAS Example to CONVERT a gzipped datasets */

/* First, decompress the non-sequential dataset */

LIBNAME revenue '~/revenue';

X "gzcat ~/revenue/prindata.ssd01.gz > ~/revenue/prindata.ssd01";

/* Make the named pipe and assign the libref new to the pipe */

X "mkfifo ~/revenue/newds" ;

LIBNAME new '~/revenue/newds' ;

/* Associate the fileref "newgz" with the gzip command */
/* Note that the gzip command MUST END with & */

FILENAME newgz PIPE 'gzip -c < ~/revenue/newds > ~/revenue/prndata.ssd.gz &';

/* Start the gzip process in the background */

DATA _null_;
INFILE newgz;
RUN;

/* Copy from lib revenue to lib new, selecting prindata */

PROC COPY IN = revenue OUT = new ;
SELECT prindata;
RUN;

/* Clean up, deleting temporary files */

PROC DATASETS LIBRARY=revenue; DELETE prindata;

X 'rm ~/revenue/newds';


Creating a gzipped Sequential SAS Dataset from Raw Data

This example is designed to show you how to create a sequential SAS dataset in gzipped form, starting with raw data on the file named "cancer.dat". The sequential SAS dataset will be named "newds.gz" at the end of the SAS job.

A key step is the construction of the FILENAME statement. Your gzip command MUST END with an ampersand, or your SAS job will hang up, doing nothing.

A slightly more complicated variation of this example would read the raw data in gzip compressed form. That exercise is left to the reader. More information can be found in the bulletin "sas_pipes".

Adapt the following SAS commands to your needs:

/* SAS Example to CREATE a gzipped Sequential Dataset */

/* Create the named pipe "newds" */

X 'mkfifo newds' ;

/* Assign a libref to the named pipe */

LIBNAME fargo 'newds' ;

/* Use "filename pipe .." to associate a fileref with the UNIX gzip command */
/* Your gzip command MUST end with an & so it is run in the background */

FILENAME nwrpipe pipe 'gzip -c < newds > newds.gz &' ;

/* This data step starts the process associated with "nwrpipe". */
/* That process will run in the background at the same time SAS is running.*/

DATA _null_;
INFILE nwrpipe;
RUN;

/* Create a sequential SAS dataset that is written to the pipe "newds" */

DATA fargo.a;
INFILE 'cancer.dat' ;
INPUT pid $ age sex weight ;
RUN;

/* Remove the named pipe */

X 'rm newds' ;

/* The file newds.gz has been written as a sequential gzipped SAS dataset */

Reading a gzipped Sequential SAS Dataset

Once you've converted your SAS datasets into gzipped sequential form, you will have to continue to use named pipes and gzcat to access your datasets. This example is a variation on the two previous examples. Again, the key is to make sure to terminate your gzcat command with an ampersand, so that it runs in the background. If you do not, your SAS job will hang, doing nothing.

Adapt the following SAS commands to your needs:

/* SAS Example to READ a gzipped Sequential SAS Dataset */

/* Create the named pipe "newds" */

X 'mkfifo newds' ;

/* Assign a libref "fargo" to the named pipe */

LIBNAME fargo 'newds' ;

/* Use "filename pipe .." to associate a fileref with the gzcat command */
/* Your gzcat command MUST end with an & so it is run in the background */

FILENAME nwrpipe pipe 'gzcat newds.gz > newds &' ;

/* This data step starts the process associated with "nwrpipe". */
/* It will run in the background at the same time SAS is running.*/

DATA _null_;
INFILE nwrpipe;
RUN;

/* Read the sequential SAS dataset that is written to the pipe "newds" */

DATA new;
SET fargo.b ;
PROC REG;
MODEL weight = age sex;
RUN;

/* Remove the named pipe */

X 'rm newds' ;



For further information


See the ``SAS Companion for the UNIX Environment and Derivatives'' in the section ``Reading from and Writing to UNIX Commands'' on page 119 of the Version 6 First Edition.

And see the man page for gzip, gunzip and gzcat, `man gzip'.

Computer and Network Security

E-mail, NetID, and Password

Hardware

Listserv

Network Services

NUTV and TV Services

Policies and Guidelines

Reserve a Facility

Service Status

Software

Telephone Services

Videoconferencing Services

Web Publishing Services

Webcasting

Webmail

Off-campus Connections

Safe access to the NU Network (VPN)

Wired Connection

Wireless access

Departmental Desktop and Server Support

NUIT Help

Student Support

Computer Labs

Course Management System (Blackboard)

Learning Opportunities

Smart Classrooms

about NUIT

Job Opportunities in NUIT

News, Press, and Publications

What's New & Changing with Technology @ NU?