University and Industry: Talent Is Out There (III)
Our supervision of students in different areas of cybersecurity continues to bear fruit. Research, development, and innovation efforts made jointly with students always produce better results than expected, as mutual benefits arise from each collaboration.
On the one hand, the student learns from the tutor's expertise, guiding their work towards a market reality that would otherwise be impossible. On the other hand, the tutor who is involved in the progress of the project takes advantage of the student's support to improve their skills, motivating the reskilling and upskilling that an academic project such as a Bachelor's Degree or Master's Degree project implies.
This time we bring two projects corresponding to the 3rd edition of the Master in Cybersecurity of UCAM in collaboration with Telefónica. These are two very different projects that show the wide variety of disciplines coexisting in cybersecurity.
The first is a proposal for an educational space called Ciberaprende, which emerged as a Master's Degree project and has become a fully operational educational platform for free training in digital skills, carried out by Javier García Cambronel. The second one is a Software of Detection and Classification of Private Content developed by Santiago Vallés. They describe their own projects.
Ciberaprende was born as a project designed to last over time. It is a virtual space whose content can be accessed for free. Ciberaprende was born on the idea of creating opportunities for a better life through training in digital skills. This space is composed of two well-differentiated parts.
The first one consists of the creation of a free learning platform based on Moodle, with a website developed in WordPress as its cover. In this part, it has been carried out the installation, configuration and securization of the server where both the platform and the website are hosted. On the other hand, the same has been done for the web and the platform itself, focusing largely on securization and adding other topics such as design, positioning, performance, accessibility, etc.
The second part of this project deals with content. The course carried out is "Computer Security: Malware" (in Spanish, Seguridad Informática: Malware), with a duration of 50 hours and a low difficulty. The objectives of the course are:
- Identify and analyse the existing risks in the field of information security.
- Learn the main types of malware.
- Learn the threat posed by malware in all its variants.
- Discover the consequences of an infection.
- Learn how to use the protection methods available to protect ourselves.
Within the course, you will find a great deal of content, structured in 5 modules:
- Module 1: Introduction to security in information systems.
- Module 2: Introduction to malware.
- Module 3: Malware, a real and current threat.
- Module 4: Security strategies against threats.
- Module 5: Security tools.
And within the contents we find:
- Over 80 pages of theory.
- More than 25 learning pills.
- Over 100 questions.
- More than 10 interactive activities.
- More than 5 case studies.
- 1 satisfaction survey.
- 1 certificate issued at the end of the course.
Ciberaprende is a project in which a lot of effort and commitment has been invested. During its development, many skills have been acquired and lessons learned that have not only allowed the assimilation of additional cybersecurity knowledge, but also a better way of disseminating such knowledge.
For example, the securization of a Linux server with different types of tools, a WordPress website and a Moodle-based platform, among the most technical ones. All this in addition to the challenge of developing an interactive course including an interesting theoretical load, with different types of activities and videos to motivate the student.
Ciberaprende has become a functional platform whose future is to improve at every step, evolving as a platform. By adding new contents and tackling other digital skills, new types of activities and games of different nature will be introduced to allow the acquisition of new skills through gamification. In addition, it will be possible to introduce multimedia content generated specifically for the courses and the subject matter of each one.
Software of Detection and Classification of Private Content
Companies own, manage, and offer numerous services that process their own information, in many cases in an automated manner. On other occasions, this transfer of documentation is carried out manually sending, receiving, and checking various contents.
Frequently, the information ends up in a web portal, a public repository or any other place used to publish content. Therefore, it is possible to crawl over these public sites, by using for example Scrapy, to download all public documents.
However, how do we know if sensitive or confidential company's or employee's information is being leaked? Is there a unique tool to address this problem? We can use free software to try to solve this issue. We will do this by using two technologies:
- Regular Expressions
- Machine Learning
The idea behind this project focuses on the generation of an indicator that we will call "File Risk", a score that represents the amount of private information that is being filtered in the document. Let's give some examples:
- Downloading a file containing a single email address is not the same as finding a list of 200 company email addresses. Therefore, we see that we have a factor that impacts on risk and it is the number of occurrences of a type of personal data.
- Let's suppose now that we are looking for a number that could match a credit card as a type of data. With only one finding we can consider that this risk is high. Therefore, we propose that the user can give a numerical value to each type of data. We will call this factor "Impact" and it is configurable by the user.
We take as an example a PDF medical report and upload it to the system. After the analysis, we obtain the following table of results:
You can see that the system found several occurrences of the data type PERSON, which refers to the name of persons. In addition, we will have a bar chart to check more quickly each of the risks by type of data:
Then, our system will generate a simple count where the file risk is the calculation of the sum of all the occurrences found multiplied by the individual impact of each one.
File risk = ∑ 𝐼𝑚𝑝𝑎𝑐𝑡 𝑛 ∗ no. of findings 𝑛
The system will show the result as "Total File Risk" with a numerical value that corresponds to the previous sum. This value can be used to filter and order hundreds or thousands of files from a website in order to focus on those that have a greater chance of containing private data.
Therefore, this system allows to combine the use of regular expressions as a static approach, covering the most common data types (ID, credit card number, etc.), together with natural language processing to detect all those kinds of words when comparing them with a trained machine learning model (using Spacy and Scikit-Learn).
Besides this, it gives the possibility to determine if that set of words composes a document that can be categorized within a certain classification. After analysing the document, the system will generate an output type:
In this example, if to the category that the model predicted (medicine) we add the names that our system found in the word analysis, we have no doubt that it would be a candidate for review. We are aware that machine learning models can be improved, but in this case we can verify that if we take the prediction with a higher percentage, the system hits the model, indicating that it is a file with medical content.
Thanks to this work we could verify the benefits of text classification methods using natural language processing and trained machine learning models. However, for a type of numerical scoring system such as the one implemented here, false positives that can increase the risk value of the file must be considered.