Project Name: Interreg (DecRIPT) Project – Detecting Various Representations of Information to Identify Personal Data Contained in Texts
The identification in texts of personal data and other data representing relevant information for users is a non-trivial and very useful issue today, the GDPR guidelines are proof of this. The protection of data in text documents, such as personal data, is a major issue for companies today. Ensuring data security has become a must for the collection and use of data, including texts that may contain personal data.
The directives of the European Union’s “General Data Protection Regulation” (GDPR) stipulate that a company that processes (collects, stores and/or uses) personal data of European nationals must be able to prove, at any time, that the personal data it holds (IBAN, telephones, various identifiers, etc.) are collected with the consent of the persons previously defined. This must be done in compliance with the principles and rights of individuals of the GDPR, and in particular that they are protected against any violation (theft, copying, erasure, modification) during the period of their storage.
The GDPR harmonises the regulations on the protection of personal data in the EU. It concerns all organisations that collect or hold personal data on European citizens, and imposes new obligations in terms of data processing, information security and transparency between companies and data subjects. This change in regulation means that companies must adapt their practices regarding the transmission of documents and data to service providers, the collection and storage of customer notices, and the aggregation and analysis of private data. For example, banks and credit card companies analyse transactions and expenditures to prevent fraud and identity theft.
In order to meet these new needs for identifying personal data, a new business line is being created “Data Protection Officer” whose missions are to bring companies into compliance with legislation, and in particular to delete or mask/offset the data of persons identified in documents. The problem is that these operations are costly and can be very time-consuming. The idea is to have it done automatically by a computer. In order to govern and to do the operation of deleting or hiding/offusing people’s data automatically, it is necessary to first tell the machine how this data is represented in the documents, the second step is to make the machine find it and finally either govern it or delete or hide/offusing it. The most important problem is how to find the data automatically. This retrieval problem is not trivial, moreover, some data can refer to a person or a place without even naming them or without using expressions that refer to them explicitly or tangible, concrete clues. It is not easy to differentiate even a name that could be confused with an acronym or acronym (LiSe, Linguistics and Security), to identify a synonym, a homonym, etc..
Semantic metamodel and its computer core :
In order to carry out our project, an Artificial Intelligence model, a semantic-based meta-model, will be developed for the identification of personal data or data with a certain value according to domains and users. This semantic-based metamodel will allow us to propose an algorithmic grammar which, with its computer kernel for executing algorithms, will automatically identify the meaning of parts of speech and text expressions composed with tracing. The meta-model that will be created will be used to identify various types of personal data, represented in various ways, in order to be able to govern or/and offend them to make possible the transmission of texts to partners or providers of the company. This will provide a reliable basis for pilot applications.
A market study will be carried out during the project to take stock of the demand in our cross-border region. Our industrial partners have already expressed the needs of their customer-companies in the area of personal data retrieval and have already received numerous requests.
The objective of the project is to automatically process the semantics of natural language texts in order to identify people’s data for their governance, security and use in the field of foresight. The main problem is to find out how these data are represented in natural language texts in order to propose a semantic model and to provide tools to locate and process (govern, delete, mask/offset) them automatically.