Clean Crsp/Compustat Merged (CCM) database

Author: Xiaoqin (Alex) Wei
Language: Python

import pandas as pd
import numpy as np
import time
from tqdm.auto import tqdm

# Initialize pandas integration with tqdm for progress tracking.
tqdm.pandas(desc="Loading Data")

# Load the Compustat data from the specified path, managing memory efficiently.
compustat = pd.read_csv(r"your input path", low_memory=False)

# Retain only records where LINKPRIM is 'P' (Primary) or 'C' (Co-primary).
compustat = compustat[(compustat['LINKPRIM'] == 'P') | (compustat['LINKPRIM'] == 'C')]

# Convert 'datadate' and 'fyear' columns to numeric types, coercing errors to NaN.
compustat['datadate'] = pd.to_numeric(compustat['datadate'], errors='coerce')
compustat['fyear'] = pd.to_numeric(compustat['fyear'], errors='coerce')

# Apply a function to fill missing 'fyear' values, showing progress with tqdm.
compustat['fyear'] = compustat.progress_apply(lambda x: x['fyear'] if not np.isnan(x['fyear']) 
                                              else (x['datadate']//10000 if (x['datadate']%10000)//100 > 5 
                                                    else x['datadate']//10000 - 1), axis=1)

# Calculate the number of missing values per row and identify duplicates based on 'GVKEY', 'LPERMNO', and 'fyear'.
compustat['missing'] = compustat.isnull().sum(axis=1)
compustat['duplicates'] = compustat.groupby(['GVKEY', 'LPERMNO', 'fyear'])['GVKEY'].transform('count')
compustat['minmis'] = compustat.groupby(['GVKEY', 'LPERMNO', 'fyear'])['missing'].transform('min')

# Keep only unique or least-missing-value records for each 'GVKEY', 'LPERMNO', and 'fyear' combination.
compustat = compustat[(compustat['duplicates'] == 1) | (compustat['missing'] == compustat['minmis'])]

# Further deduplicate records based on 'LPERMNO' and 'fyear', preferring the most recent 'datadate'.
compustat['duplicates'] = compustat.groupby(['LPERMNO', 'fyear'])['LPERMNO'].transform('count')
compustat['maxdat'] = compustat.groupby(['LPERMNO', 'fyear'])['datadate'].transform('max')
compustat = compustat[(compustat['duplicates'] == 1) | (compustat['datadate'] == compustat['maxdat'])]

# Remove columns used for intermediate calculations.
compustat.drop(columns=['missing', 'duplicates', 'minmis', 'maxdat'], inplace=True)

# Save the cleaned dataset to the specified output path.
compustat.to_csv(r"your output path")

Reference:
Clean CRSP/Compustat Merged database, 钰媛沁风 (2022)
How to remove duplicate GVKEY-DATADATE when using Compustat Annual (FUNDA) and Quarterly (FUNDQ)?, Kai Chen (2015)
Look into CRSP/Compustat link table, Kai Chen (2015)