AIRCC PUBLISHING CORPORATION
A COMPARATIVE STUDY OF THE APPROACH PROVIDED FOR PREVENTING THE DATA LEAKAGEKamaljeet Kaur1, Ishu Gupta2 and Ashutosh Kumar Singh21Govt. Sr. Sec. School, Ambala, Haryana, India2Department of Computer Applications, National Institute of Technology, Kurukshetra, Haryana, India
Data is the most valuable assets of an organization that need to be secured. Due to limited computational resources, Customers outsource their workload to cloud and economically enjoy the massive computational power, bandwidth, storage, and even appropriate software that can be shared in a pay-per-use manner. Despite tremendous benefits of cloud computing, protection of customers’ confidential data is a major concern. Data leakage involves the intentional or unintentional release of secure or confidential information to the non-trusted environment. Data leakage poses a serious issue for companies as the number of incidents and the cost to those experiencing them continue to increase. Data leakage is enhanced by the fact that transmitted data (both inbound and outbound); including emails, instant messaging, website forms and file transfers. Data leakage prevention system (DLPS) is a strategy for making sure that end users do not send the confidential data or information outside the corporate network. This review paper aims to study data leakage prevention through some challenges and data protection approaches as well as considering some limitations. This survey of DLPS can benefit academics as well as professionals.
Data Access & Protection, Data Leakage Prevention (DLP), Information Security, Insider Attacks, Sensitive Data
Data leakage is defined as the accidental or intentional distribution of confidential data to an unauthorized entity , , . Confidential data of companies and organizations include intellectual property, financial information, personal credit card data scores, information about their sanctions and other information depending upon the business , ,  . Data leakage is a serious threat to organizations as the number of incidents and the cost to those experiencing them continue to increase , . Data leakage is magnified by the fact that transmitted data are not regulated and monitored on the way to their destination . The diffusion of data can be done through digital media as well as by the company’s official person also , .
As shown in Fig. 1 it is more severe when this is carried out by insiders , . Theresearchers discovered that despite the security policies, procedures, and tools currently in place, employees around the world are engaging in risky behaviours that put corporate and personal data at risk , . Organizations provide easy access to databases for information sharing, storage and compression technology has allowed for more powerful (high-risk) endpoints , . An 80-MB mobile device now holds 6000 Microsoft Word documents or 7, 20,000 emails, and new 64-GB removable devices allow an entire hard drive to be copied onto a device same as the size of a pack of gum. These devices make it easier for employees, partners, or data thieves to access, move, or lose intellectual property or customer data. Mitigating data leakage from insider threats is a difficult challenge , . Data leakage can occur in many forms and in any place , . In the survey of United States in 2014, Cybercrime emphasizes on the seriousness of attacks caused by insiders of the companies. According to the survey report, companies experienced 37% internal attacks caused by insiders and researchers mentioned that the ratios of insider attacks are more destructive as compared to the attacks that are performed outside of the company. The ratio of private information that accidentally opens to the elements was 82% and in 76% of cases, confidential accounts were stolen .
Figure 1: Main Cause of Data Leakage
According to statistics, it is stated that insider attacks have a high rate among other attacks that cause data leakage. By using Deep Content Analysis (DCA) techniques such as rule-based, regular expressions, database fingerprinting, exact file matching, statistical analysis, DLPS easily finds out the ‘sensitivity’ of the information and used to detect ‘sensitive’ information within traffic. This can be done either to classify the information into categories (e.g. ‘confidential’, ‘secret’) or to detect sensitive information within (outgoing) data. When a sensitive piece of information is found leaving the company, DLPS triggers the appropriate alert and action to be taken. There is a necessity to implement DLP controls and supporting information security controls on time so that the effectiveness of these controls monitored over time. It helps to improve the management of data with minimum risk. The aim to design and develop DLPSs is to prevent data from breaches.
We can solve the data leakage problem by using Data Leakage/Loss Prevention System (DLPS) shown in Fig. 2. Generally, DLPS as represented in Fig. 3 is used to discover, monitor, and protect the following type of data , .
Figure 2: Data Leakage Prevention System
Figure 3: Data leakage prevention system (DLPS)
2. RELATED WORK
This section analyses the various works that have been proposed in the area of data leakage prevention.Tahboub et al. described the importance of the information regards to companies and the seriousness of corporate data leakage . Paper studied the current systems used to protect data and the DLP system in terms of their components, methods used and the differences between them. This paper explained the difference between existing systems and DLP system as well as described the importance of covering the shortage existed in current DLP systems such as developing policies, Integration with other systems like encryption, audio, video and images etc.
Raman et al. described the importance of the DLP research area . They mentioned common DLP approaches and associated problems. In addition, they suggested new directions for future work, introduced text clustering and social network analysis as future solutions for the data leakage problem.
Wu et al. presented an active DLP model to resolve the issue of parsing different file formats that current commercial DLP systems encountered . A user’s key stroke behaviour is analyzed by directly observing the different control keys and inferring actual user typing based on such keys. After recording and analyzing the typing frequency, the actual data creator can be identified as well. This combination approach provides high visibility to the content of files without parsing them.
Alneyadi et al. explained DLP classification model based on the well-known information retrieval function TF-IDF to define terms weights . The classification was based on measuring the similarity between the documents and the category centroids. This model was tested against different scenarios such as known, partially known and unknown data that helps DLPS to deal with data leakage.
Costante et al. presented a hybrid approach for data leakage prevention . Existing approaches focus either on prevention, for e.g. by applying signature-based techniques that are unable to detect zero-day attacks, or on detection, e.g. by applying anomaly-based techniques that suffer a high false positive rate thus have high operational costs. This hybrid approach overcomes these limitations by combining a white-box anomaly based detection technique which is able to raise alerts for any previously unseen transaction, with a rule-based prevention technique. This rule based technique blocks the transactions, when an operator has previously flagged as malicious. It determines whether an anomaly raised by the detection engine is an actual attack or not.
3. CHALLENGES IN DLPS
There are Common behaviours resulting in potential risk of data leakage like both physical and logical Access control, Accessing unauthorized websites, Leaving passwords unprotected and much more. This section illustrates the current challenges as shown in Fig. 4 to be solved by the DLP as follows:
1. Encryption Challenge- encryption is only one approach to secure data and security also requires access control, data integrity, system availability and auditing. So, it is difficult to detect and intercept encrypted confidential data and to recognize the data leakage occurring over encrypted channels .
2. Access Control Challenge- In the field of Information Security, Access control is a way of limiting access to a system, or to physical or virtual resources. In corporate, it is not easy to control employee’s access to data repositories. For e.g. An employee of a company want to access data that he/she is not involved into, can steal some information if an access control system grants full access to all code repositories for all employees .
3. New Data and Customization Challenge- Sometimes, it is difficult to customize a DLP system particular for an employee, if the system utilizes old methods of data protection like regular expressions, keywords, or digital fingerprints. To create regular expressions, manual keywords for new customization process may take longer time. Moreover, this process is meant to be repeated as a new type of confidential data appears.
Figure 4: DLPS challenges
4. Social Network Challenge- It is not sufficient to capture heterogeneous communicationgroups where people belong to more than one group, or even more when new communication groups are formed, old one disappears. In this situation, it is difficult to reveal a person who leaks the data (an outsider) in a communication or to detect persons having limited access to data .
4. CURRENT APPROACHES FOR DLP
This section categorizes current approaches for Data Leakage Prevention as represented in Fig. 5 and identifies their main benefits and shortcomings:
A Learning and specification based system for Data leakage Prevention- This hybrid model combines signature based and anomaly based solutions, enabled on both detection and prevention. Two main dimensions are used to characterize the model: i) filtering approach, which describes whether users are permitted or not and ii) model construction, which describes how a model is constructed. In Filtering, A blacklist is used for well-known threats or undesired behaviours and while listing is used to identify the permissible activities. Only those transactions are considered to be legitimate that will match the model. Two main approaches are used to build the model i) Specification approach and ii) Learning-approach .
• Specification-approach: This approach is based on expert’s knowledge and background of the transactions that lead to very accurate models. As, for instance, specification basedblacklisting systems, also known as Signature- based systems that find the known attacks. A specification-based white listing system is used to detect unknown attacks.
• Learning-approach: This approach automatically learns the behavior of model using some techniques like machine learning and statistical modeling. Shortcomings: These approaches created models that are less accurate as compared to those manually specified. As a consequence, these are inclining to high false positive rate. To check whether transaction is legitimate or not, a large number of alerts are generated and analyzed by human operator that cause to high operational cost .
Figure 5: Approaches for Data Leakage Prevention
Secure Key Stream Analyzer for Data Leakage Prevention- This approach illustrates that many data leakage prevention solutions depend on scanning file content. Key Stroke Profile not only scans the content of file rather it is capable to parse different file formats. But, the risk of data leakage still exists for unsupported file formats. This approach proposed a new DLP model named as Secure Key stream Analyzer (SKA) .
Shortcomings: There are some issues in keyboard API that needs to be solved: Instead of using a keyboard, if a user uses the mouse to make some text modifications like copying text and pick information from auto filled forms, in this situations SKA does not work. It only detects the text typed linearly .
A Result based Approach for Data Leakage Prevention- This approach discussed an information flow between one origin and many destinations (receivers) .The Partially Observable Markov Decision Processes (POMDPs) method is used over a fixed period called decision epochs where:
Shortcomings: As the ratio of leak packets increase, it increases the tolerance at origin side, results in effect on the expected incentive of its most favourable strategy.This POMDP requires a huge amount of calculation and it suffers from scalability limitations.
There is a need of DLP solution that will allow secure sharing of confidential information in companies .
A Turkish Language Based Data Leakage Prevention System- This approach proposed a data leakage prevention system for Turkish language consisting two phases i) training phase and ii)detection phase. Two algorithms are used to describe the system: Boyer Moore (BM Algorithm)  is used to search exact sensitive strings exposed to whitespace attack and Smith Waterman(SW) sequential alignment algorithm  is used to detect modified string attacks.
TF-IDF method is used to extract the sensitive words of sensitive documents. Latent Semantic Indexing (LSI) is used to construct the model document topics. This approach used Zemberek tool for extracting and analyzing the Turkish language .
Shortcomings: Attacks like adding, deleting and changing characters in ‘sensitive’ word,deleting white spaces from both sides of ‘sensitive’ word and adding white space to themiddle of the ‘sensitive’ word were used to design the system. This tool is not onlyrequired for Turkish/English, but also for other languages .
5. DATA LEAKAGE PROTECTION TECHNIQUES
Data protection for various data states is represented in Fig. 6 and Fig. 7 shows the various activities performed by DLPS to protect the data in various states.
Figure 6: Data leakage protection for different data states.
Figure 7: DLPS activities
Safety measures for Data-at-Rest:To protect data leakage, content discovery solutions isrequired. It helps to detect the sensitive data reside in separate locations by performing scanning in laptops, FTP servers, SMTP servers and in database . Techniques for content discovery are as follows:
Disadvantage: On the target system, agents have low processing power and less memory.
Disadvantage: When scanning is performed from a remote computer that results in increased network traffic and low performance.
Safety measures for Data-in-Motion: Network-based solutions are deployed on company’s gateway. Gateway computer searches the sensitive content and block the malicious activities immediately that violates the policy. These solutions capture the full data and perform the content analysis in real time , .
Safety measures for Data-in-Use: Local agents and host machines regularly check sensitive data such as data copied from one location and pasted into another location, data from the print screen, unauthorized data transmission and copying data to a USB/CD/DVD .
A DLP solution help organization to control sensitive data, but it has some pretty significant limitations also.
7. FUTURE ANALYSIS FOR DLPS
In future, following activities will be followed to prevent company’s data from leakage.
These are the major factors that contribute to the growing Data Leakage market. DLP solution focuses on organizations towards meeting regulatory and compliance requirements and data saved on the public and private cloud.
8. CONCLUSIONS AND FUTURE DIRECTIONS
In this paper, we discussed the challenges in DLPs and current approaches for data leakage prevention. We described how company’s confidential information can be protected from unauthorized user’s access. We explained various techniques like learning and specification, secure key stream analyzer, The result based approach for data leakage prevention, but still, there are various elements that leak the company’s data. As we know data leakage happens through social media, cybercrimes and with the help of insider attacks. All these factors have a great impact on the company’s reputation. Companies know which data is important to their business, where it is located and how it is sent to the outside network. Companies should enforce some policies, rules & regulations to prevent their data from unauthorized user’s access.
Data Leakage Prevention System is a solution for all these problems that helps to discover, monitor and project the company’s important data. There are some challenges that need to be solved. Cluster analysis algorithm has the ability to group data into a cluster for further analysis that will help to cope with access control challenge and social network challenge.
Hence, there is necessity of research that will take a balanced approach for cloud computing data leakage and incorporate not only to end-users but also with cloud provider and the cloud customers.
 J.Kim,and H.J.Kim,”A Study on Privacy Preserving Data Leakage Prevention System,” Recent Progress in Data Engineering and Internet Technology, 2012, pp. 191-196.
 M. Samanta,P.Pal and A.Mukherjee,”Prevention of information leakage by modulating the trust uncertainty in Ego-Network,” 2017 9th International Conference on Communication Systems and Networks (COMSNETS), Bengaluru, 2017, pp. 377-378.
 I.Gupta, and A.K.Singh,”Data Leakage Detection in Cloud Environment,” technical report, NIT Kurukshetra, India, 2017.
 H. Li,Z.Peng, X.Feng and H.Ma, “Leakage Prevention Method for Unstructured Data Based on Classification,” in International Conference on Applications and Techniques in Information Security, Springer, Berlin, Heidelberg, vol. 557, pp. 337-343, Nov. 2015.
 C.-J.Chae, Y.J. Shin, K. Choi, K.-B. Kim, and K.-N. Choi, “A privacy data leakage prevention method in P2P networks,” Peer-to-Peer Networking and Applications, vol. 9, no. 3, 2016 pp. 508-519.
 H. Taneja, Kapil and A. K. Singh, “Preserving Privacy of Patients based on Re-identification Risk,” Fourth International Conference on Eco-friendly Computing and Communication Systems( ICECCS), 2015, pp. 448-454.
 K.W.Kongsgård, N. A. Nordbotten, F. Mancini, and P. E. Engelstad, “Data loss prevention based on text classification in controlled environments,” In Information Systems Security, pp. 131-150. Springer International Publishing, Dec. 2016.
 Y.Wu,”A practical approach for risk assessment of data leakage prevention in telecommunication industry,” in E-Business and E-Government (ICEE), 2010 International Conference on, pp. 3552-3555, IEEE, 2010.
 J.Kumar and A. K. Singh, “Dynamic resource scaling in cloud using neural network and black hole algorithm,” 2016 Fifth International Conference on Eco-friendly Computing and Communication Systems (ICECCS), Bhopal, 2016, pp. 63-67.
 A.Harel, A. Shabtai, L. Rokach and Y. Elovici, “Dynamic Sensitivity-Based Access Control,” Proceedings of 2011 IEEE International Conference on Intelligence and Security Informatics, Beijing, 2011, pp. 201-203.
 T.Takebayashi, H.Tsuda, T.Hasebe, and R. Masuoka, “Data loss prevention technologies.” Fujitsu Scientific and Technical Journal, vol. 46, no. 1, 2010 pp. 47-55.
 S.Alneyadi, E.Sithirasenan and V.Muthukkumarasamy,”Detecting Data Semantic: A Data Leakage Prevention Approach,” 2015 IEEE Trustcom/BigDataSE/ISPA, Helsinki, 2015, pp. 910-917.
 S.Verma and A. Singh, “Data theft prevention & endpoint protection from unauthorized USB devices — Implementation,” 2012 Fourth International Conference on Advanced Computing (ICoAC), Chennai, 2012, pp. 1-4.
 Ernst & Young, “Data loss prevention: Keeping your sensitive data out of the public domain,” Insights on governance, risk and compliance, October 2011.
 J.M.Gomez-Hidalgo, J. M. Martin-Abreu, J. Nieves, I. Santos, F. Brezo and P. G. Bringas, “Data Leak Prevention through Named Entity Recognition,” 2010 IEEE Second International Conference on Social Computing, Minneapolis, MN, 2010, pp. 1129-1134.
 I.Gupta, and A.K.Singh,”Privacy and Security Architecture for Cloud Data,” technical report, NIT Kurukshetra, India, 2016.
 S.Chhabra and A.K.Singh, “Dynamic data leakage detection model based approach for MapReduce computational security in cloud,” 2016 Fifth International Conference on Eco-friendly Computing and Communication Systems (ICECCS), Bhopal, 2016, pp. 13-19.
 S.Alneyadi, E. Sithirasenan and V. Muthukkumarasamy, “Detecting Data Semantic: A Data Leakage Prevention Approach,” 2015 IEEE Trustcom/BigDataSE/ISPA, Helsinki, 2015, pp. 910-917.
 S.Alneyadi, E.Sithirasenan and V. Muthukkumarasamy, “Discovery of potential data leaks in email communications,” 2016 10th International Conference on Signal Processing and Communication Systems (ICSPCS), Gold Coast, QLD, 2016, pp. 1-10.
 B.M.Babu and M.S.Bhanu, “Prevention of Insider Attacks by Integrating Behavior Analysis with Risk based Access Control Model to Protect Cloud,” Procedia Computer Science, Vol. 54, pp. 157-166, 2015.
 D.Kolevski and K. Michael, “Cloud computing data breaches a socio-technical review of literature,” 2015 International Conference on Green Computing and Internet of Things (ICGCIoT), Noida, 2015, pp. 1486-1495.
 S.Mathew and M. Petropoulos, “A data-centric approach to insider attack detection in database systems, ” in Recent Advances in Intrusion Detection, ser. LNCS 6307, Springer, pp. 382–401, 2010.
 Frost and Sullivan, “World Data Leakage Prevention Market,” Technical Report ND34D-74, United States, 2008.
 B.Hauer,”Data and Information Leakage Prevention Within the Scope of Information Security,” in IEEE Access, vol. 3, no., pp. 2554-2565, 2015.
 R.Tahboub and Y. Saleh, “Data Leakage/Loss Prevention Systems (DLP),” 2014 World Congress on Computer Applications and Information Systems (WCCAIS), Hammamet, 2014, pp. 1-6.
 P.Raman, H. G. Kayacık, and A. Somayaji, “Understanding Data Leak Prevention,” in 6th Annual Symposium on Information Assurance (ASIA’11), pp. 27, 2011.
 J.S.Wu,Y.J.Lee, S. K. Chong, C. T. Lin and J. L. Hsu, “Key Stroke Profiling for Data Loss Prevention,” 2013 Conference on Technologies and Applications of Artificial Intelligence, Taipei, 2013, pp. 7-12, 2013.
 E. Costante,D.Fauri, S.Etalle, J. D. Hartog and N. Zannone, “A Hybrid Framework for Data Loss Prevention and Detection,” 2016 IEEE Security and Privacy Workshops (SPW), San Jose, CA, 2016, pp. 324-333.
 S. Alneyadi, E.Sithirasenan,V. Muthukkumarasamy, “A survey on data leakage prevention systems,” Journal of Network and Computer Applications, Vol. 62, pp. 137-152, February 2016.
 DLP Technologies,Challenges and Future Directions 268462340_ [accessed Jun 23, 2017].
 A.Shabtai,Y. Elovici and L. Rokach, “A survey of data leakage detection and prevention solutions”, ser. Springer Briefs in Computer Science, Springer-Verlag, New York, 2012.
 K.Revett, F.Gorunescu, M.Gorunescu, M.Ene, S. T. de Magalh˜aes and H. M. D. Santos, “A machine learning approach to keystroke dynamics based user authentication,” Int. J. Electronic Security and Digital Forensics, Vol. 1, No. 1, pp. 55–70, 2007.
 J.Marecki, M.Srivatsa and P.Varakantham, “A Decision Theoretic Approach to Data Leakage Prevention,” 2010 IEEE Second International Conference on Social Computing, Minneapolis, MN, 2010, pp. 776-784.
 M. Srivatsa, P.Rohatgi, S. Balfe and S. Reidt, “Securing information flows: A metadata framework,” in Proceedings of 1st IEEE Workshop on Quality of Information for Sensor Networks (QoISN), 2008.
 Y.Jeong, M.Lee, D.Nam, J.-S. Kim, and S. Hwang,”High performance parallelization of Boyer–Moore algorithm on many-core accelerators,” Cluster Computing, vol. 18, pp. 1087-1098, 2015.
 Y.Canbay, H.Yazici and S.Sagiroglu,”A Turkish language based data leakage prevention system,” 2017 5th International Symposium on Digital Forensic and Security (ISDFS), Tirgu Mures, 2017, pp. 1-6.
 Y.Liu,C.Corbett, K.Chiang, R. Archibald, B. Mukherjee, and D. Ghosal, “Detecting sensitive data exfiltration by an insider attack,” in Proceedings of the 4th annual workshop on Cyber security and information intelligence research: developing strategies to meet the cyber security and information intelligence challenges ahead, pp. 16, 2008.
 S.Liu and R. Kuhn, “Data Loss Prevention,” in IT Professional, vol. 12, no. 2, pp. 10-13, March-April 2010.
 G.Lawton,”New Technology Prevents Data Leakage,” in Computer, vol. 41, no. 9, pp. 14-17, Sept. 2008.
 “Data leak prevention,” Information Systems Audit and Control Association, Technical Report, 2010.
 K. Kaur, I. Gupta and A. K. Singh, “Data Leakage Prevention: E-Mail Protection via Gateway,” 10th International Conference on Computer and Electrical Engineering, Canada, 2017.