Data Policing Using SAS

Be it Analytics,Data Science or MI: Data is at the heart of everything we do. There is no denial about the fact how important the data is, but what is more important is the quality of that data. The quality of data contributes directly to the quality of work in analytical space.How true is your data and how consistent is it across the organization will eventually decide the authenticity of your reports and veracity of your predictions.

Apart from data quality,data lineage and data integrity there is one common problem that finds its presence in almost every industry and that is the problem of data duplication. Its not to be confused with data redundancy in a database, having identical records across different tables. The problem here is at the dataset level, the redundancy of the datasets that contain same data and are actually the copies of one another. It starts when every team wants a copy of its own; its not because they are selfish but often because teams are working in isolation. They are unaware of their common data requirements.You will be surprised to know how many copies of ‘customer’ you can find across your organization.

The ever increasing appetite for storage is caused by this kind of duplication. If this is not kept in check it can bury you under a data avalanche.Now it might not be the problem when we are using cheap commodity storage like hadoop, but for the rest of storages out there: its a nightmare in terms of price tag to match that requirement.

The data policing is the process where duplicates data sets are hunted down and eventually eliminated. It aims at creating the single-instance storage on your platform. It will enable you to use disk space more efficiently by maintaining only one copy of the file, and replacing the redundant copy with pointer to original file.Some of the benefits that you get out of this process:-

No more forever running backups if de-duplication is done at source, You can reduce the backup data footprint significantly.
Data movement across network is reduced for disaster recovery.
The storage requirements for the platform don’t grow exponential to data growth.

Data Policing can be quite a tedious task, considering the complexities involved: the most common being the naming itself. People can name these copies anything, be it their fiances name,their favorite car, the movie they watched last weekend…It can actually be anything.

So how we hunt the duplicates then ? The following macro will do it for you . It works by collecting the metadata information about the datasets along with some other parameters like the observation count. It can handle datasets with different names and still spot the duplicates.

%MACRO DataPolice(root/*Path from where to scan*/);

/*Macro Starts*/

/*Dummy macro to retrieve color coding*/    %macro dummy;%mend dummy;

/*Wipe the work*/

PROC DATASETS LIBRARY=WORK KILL;
RUN;

/*List of folders and subfolders*/

Data _NULL_;
X "find &root -type d -ls >&root/list.txt";
run;

/*Import directory list into SAS*/

DATA X;
    LENGTH VAR1 $300;
    infile "&root./list.txt" truncover;
    INPUT VAR1 $300.;
RUN;

/*Extract the 'path' from the imported file*/

Data Y;
Set X;
    _LOC1=FIND(VAR1,"&root");
    _LOC2=(LENGTH(VAR1)-_LOC1)+1;
    PATH=SUBSTR(VAR1,_LOC1,_LOC2);
RUN;

/*Count of observations(no of paths) for the loop*/

DATA _NULL_;
    SET Y;
    CALL SYMPUT('OBSR',_N_);
RUN;

%PUT OBSR=&obsr.;

/*For every single path*/

%do u=1 %to &obsr.;

/*One path at a time*/

Data _NULL_;
    SET Y;
    IF _N_=&U;
    CALL SYMPUT('PATH',PATH);
RUN;

/*Libname the Path*/

LIBNAME x "&PATH.";

/*Extract info about dataset from vtable vcolumn for all datasets in that path*/

DATA VTABLE;
    SET SASHELP.VTABLE;
    IF LIBNAME='X';
RUN;

DATA VCOLUMN;
    SET SASHELP.VCOLUMN;
    IF LIBNAME='X';
RUN;

/*For first dataset*/

        %if &u==1 %then %do;

/*Corresponding details from vtable,vcolumn*/

DATA DETAILS_L1A;
SET VTABLE(KEEP=MEMNAME NOBS NVAR);
        PATH="&PATH";
RUN;

DATA DETAILS_L1B;
SET VCOLUMN(KEEP=MEMNAME NAME TYPE LENGTH);
    PATH="&PATH";
RUN;

%end;

/*If not first then extract and append */

                                        %else %do;

DATA ADD1;
SET VTABLE(KEEP=MEMNAME NOBS NVAR);;
    PATH="&PATH";
RUN;


DATA ADD2;
SET VCOLUMN(KEEP=MEMNAME NAME TYPE LENGTH);;
    PATH="&PATH";
RUN;

PROC APPEND BASE=DETAILS_L1A DATA=ADD1;
RUN;

PROC APPEND BASE=DETAILS_L1B DATA=ADD2;
RUN;

                                            %end;
    
/*Combine info from Vtable,Vcolumn for all datasets*/

PROC SORT DATA=DETAILS_L1A;
BY MEMNAME PATH;
PROC SORT DATA=DETAILS_L1B;
BY MEMNAME PATH;
RUN;


DATA DETAILS_L1;
MERGE DETAILS_L1B(IN=A) DETAILS_L1A(IN=B);
    BY MEMNAME PATH;
RUN;


/*Matching algorithm starts*/

/*Count of character variables in dataset*/

PROC SQL;
CREATE TABLE SUMMARY_1 AS SELECT PATH,MEMNAME,COUNT(*) AS C_CHAR FROM DETAILS_L1 WHERE TYPE='char' GROUP BY MEMNAME,PATH;
QUIT;

/*Count of numeric variables in dataset*/

PROC SQL;
CREATE TABLE SUMMARY_2 AS SELECT PATH,MEMNAME,COUNT(*) AS N_CHAR FROM DETAILS_L1 WHERE TYPE='num' GROUP BY  MEMNAME,PATH;
QUIT;

/*Sum of length of variables in dataset*/

PROC SQL;
CREATE TABLE SUMMARY_3 AS SELECT PATH,MEMNAME,SUM(LENGTH) AS S_LEN FROM DETAILS_L1  GROUP BY  MEMNAME,PATH;
QUIT;

/*Observations*Variable in dataset*/

PROC SQL;
CREATE TABLE SUMMARY_4 AS SELECT PATH,MEMNAME,SUM(NOBS) AS S_NOBS FROM DETAILS_L1  GROUP BY  MEMNAME,PATH;
QUIT;

/*nvar*Variable in dataset*/

PROC SQL;
CREATE TABLE SUMMARY_5 AS SELECT PATH,MEMNAME,SUM(NVAR) AS S_NVAR FROM DETAILS_L1  GROUP BY  MEMNAME,PATH;
QUIT;

/*All summary together*/

DATA SUMMARY;
MERGE SUMMARY_1 SUMMARY_2 SUMMARY_3 SUMMARY_4 SUMMARY_5;
BY MEMNAME PATH;
RUN;


%end;  /*For u*/


/*Split into Unique,Duplicate based on Scores from algorithm above*/
 
PROC SORT DATA=SUMMARY NODUPKEY DUPOUT=DUP OUT=UNIQ;
    BY C_CHAR N_CHAR S_LEN S_NOBS S_NVAR;
RUN;

/*Create report with details*/

PROC SQL;
CREATE TABLE REPORT AS 
SELECT A.MEMNAME AS COPY,A.PATH AS COPY_PATH,B.MEMNAME AS RETAIN,B.PATH AS RETAIN_PATH FROM DUP A 
LEFT JOIN UNIQ B ON 
A.C_CHAR=B.C_CHAR AND 
A.N_CHAR=B.N_CHAR AND 
A.S_LEN=B.S_LEN AND 
A.S_NOBS=B.S_NOBS AND 
A.S_NVAR=B.S_NVAR;
QUIT;

/*Algorithm ends*/

/*Suffix for the filename*/

Data _NULL_;
CALL SYMPUT('DT',PUT(DATETIME(),datetime16.));
run;
 

/*Create report in PDF*/

OPTIONS NONUMBER;

ODS PDF FILE="/home/ubaid/DDD_Report_&DT..pdf" DPI=100 COLOR=GRAY STYLE=SASWEB;

FOOTNOTE1 "Source Code/Algorithm by Ubaid Darwaish.";
FOOTNOTE2 "Generated (&_SASSERVERNAME, &SYSSCPL) on %TRIM(%QSYSFUNC(DATE(), NLDATE20.)) at %TRIM(%SYSFUNC(TIME(), TIMEAMPM12.))";

PROC REPORT DATA=REPORT NOWD ;
    COLUMN COPY COPY_PATH RETAIN RETAIN_PATH;
        DEFINE COPY/DISPLAY 'COPY';
        DEFINE COPY_PATH/DISPLAY 'COPY_PATH'; 
        DEFINE RETAIN/DISPLAY 'ORIGINAL';
        DEFINE RETAIN_PATH/DISPLAY 'ORIGINAL_PATH';

/*LOGO inclusion*/

                    COMPUTE BEFORE _PAGE_ / LEFT  

   STYLE=[PREIMAGE='/home/ubaid/ud.png'  
          FONT_WEIGHT=BOLD 
          FONT_SIZE=5 
          FOREGROUND=cx993300]; 
   LINE "DataPolice ©."; 

                    ENDCOMP; 
RUN;


Footnote1;
Footnote2;

%mend;

/*Macro ends*/

I used the following code to test the macro:-

/*Demo*/

/*Libname X '/home/ubaid/data';*/
/*Libname Y '/home/ubaid/logs';*/
/*Libname Z '/home/ubaid/data/in_data';*/
/**/
/*DATA X._1CARS X._1MOTOR Y._1SPEED Z._1TEMP;*/
/*SET SASHELP.CARS;*/
/*RUN;*/
/**/
/*DATA Y._2CLASS Z._2SCHOOL;*/
/*SET SASHELP.CLASS;*/
/*RUN;*/
/**/
/**/
/*DATA Z.MAP;*/
/*SET SASHELP.ORSALES;*/
/*RUN;*/

/*%DataPolice(/home/ubaid);*/

Important note:-

This is the bare basic version of data policing and can be customized and optimized for improvement.
The identification methods are limited to metadata level along with couple of more
parameters, metadata equivalence does not translate to data equivalence.
The ‘VTABLE’ and ‘VCOLUMN’ are actually views, if you intend to run this macro for
significant number of directories and sub-directories ; then it will be efficient to run
all the libnames first and then physicalize the ‘VTABLE’ and ‘VCOLUMN’ views.
This macro has been designed for SAS datasets only.

Sample Output:

Contributed by: Ubaid Darwaish