Now you can share your data without worrying about data security

Namaste,

We all know that data is the new oil (and insights is the new king). With data getting generated from innumerable sources (Facebook, Twitter, YouTube, Snapchat, Uber, Web traffic, Google Searches), the data security should not just get limited to “contractual terms”.

Here are few facts about data:

“We create as much information in two days now as we did from the dawn of man through 2003.” – Eric Schmidt

90% of the world’s data was generated over the last two years.

Every second, 40,000 searches are performed on Google.

Every minute, 4.1 million YouTube videos are watched.

I must say, data never sleeps!

With data being the center point of everything, it’s a must to secure private and confidential information. We are not trying to solve the world’s data security problem. Instead, through this series of blog post, we would show techniques to anonymize and secure our customer’s data (while preserving analytic utility – re-read this line).

But, wait a second. What are we trying to solve? Through data security techniques, we would want to protect end-users’/end customers’ personal information (Name, email, phone, national Id), and protect other confidential and sensitive information like revenue, salary, internal data, patient health information, trip route, personal chat messages etc.

Removing or encrypting such attributes is not a solution as it would remove data’s analytic utility. Giving all this information without any control is also not a solution since it would lead to data privacy issues. What should we do then? Ideally, we would want to be in the middle of this curve. The privacy and risk should be at an acceptable level while preserving analytic utility.

privaycurve

Are there any methods that can help maintain an appropriate balance between privacy protection and data analytic utility? This is what we would learn in this series of blog posts.

Here’s how we would structure next set of posts:

  1. Types of identifiers (Direct identifiers and Quasi Identifiers)
  2. Methods and Techniques to protect these identifiers
    1. Suppression
    2. Generalization
    3. Randomization
    4. Pseudonymization
  3. Methods to protect:
    1. Cross-sectional data
    2. Longitudinal data
    3. Unstructured data
  4. Other data protection methods
    1. Mapping
    2. Rounding
    3. Top and Bottom coding
    4. Data synthesis
  5. Data sharing options
    1. Sub-sampling
    2. VPN – Protected infrastructure
  6. Conclusion

The purpose of this post was to give a premise of how we protect customer’s data and suggest practical approaches to data anonymization and sharing.

How do you protect this information? What tools and techniques you use? We would love to know.

Thanks,

R

References:

https://www.forbes.com/sites/bernardmarr/2018/05/21/how-much-data-do-we-create-every-day-the-mind-blowing-stats-everyone-should-read/#f56f49460ba9

Eric Schmidt: Every 2 Days We Create As Much Information As We Did Up To 2003

https://www.marketingprofs.com/charts/2017/32531/the-incredible-amount-of-data-generated-online-every-minute-infographic

http://www.ehealthinformation.ca/media/