![]() ![]() The goal: given a target dataset (for example, a CSV file with multiple columns), produce a new dataset such that for each row in the target, the anonymized dataset does not contain any personally identifying information. For this post, I'll explore using the Faker library to generate a realistic, anonymized dataset that can be utilized for downstream analysis. This community has developed plenty of tools for generating very realistic data for a variety of information types. The good news is that we can take a cue from the database community, who routinely generate simulated data to evaluate the performance of a database system. As a result, issues related to entity resolution, like managing duplicates or producing linkable results, frequently come into play. A simple mapping of real data to randomized data is not enough, because in order to be used as a stand in for analytical purposes, anonymization must preserve the semantics of the original data. Unfortunately, this is not as easy at it sounds. ![]() Unfortunately, non-trivial datasets can be hard to find for a few reasons, one of which is that many contain personally identifying information (PII).Ī possible solution to dealing with PII is to anonymize the dataset by replacing information that would identify a real individual with information about a fake (but similarly behaving or sounding) individual. However, nothing can replace an actual, non-trivial dataset for a tutorial or lesson, because only that can provide for deep and meaningful exploration. The best libraries often come with a toy dataset to illustrate examples of how the code works. ![]() In order to learn (or teach) data science you need data (surprise!). If you want to keep a secret, you must also hide it from yourself. ![]()
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |