A background check is a process a person or company uses to verify that a person is who they claim to be, and provides an opportunity for someone to check a person’s criminal record, education, employment history, and other activities that happened in the past in order to confirm their validity.
In majority of cases where background screening is needed, speed of the process plays a crucial role. When done manually or using outdated tools, the background checking process of a candidate can take days or weeks, especially if there are many candidates involved and a thorough check is required.
There are many background screening services on the market, but none of them satisfied all functional requirements for this project. We were tasked with building a new solution making use of Machine Learning and other advanced technologies to make the process faster and more accurate.
The main goal of this project was to implement a scalable SaaS background check solution that covers all relevant data sources from social media (Facebook, Instagram, etc.), portals (Google, Bloomberg) or more “official” type of data including criminal or civil records provided by courts or other government institutions. All this information needs to be analyzed and aggregated to provide a comprehensive background check of a person or a business.
Services that provide this type of data can be public or paid, and they provide information in a myriad of various format – PDF, Word documents, JSON, HTML, etc. This mixture of unstuctured, semi-structured and structured data makes it very difficult to collect, clean, parse, analyze, rate and categorize information and a lot of manual work is required to make a sense of it.
Clients need comprehensive and complex reports on their subjects, and the existing products and services required a lot of copying and pasting between various data sources to produce deliverable documents. This process has to be automatized using Natural Language Understanding techniques and Robotic Process Automation tools.
An interactive and intuitive UI was developed to enable investigators to quickly comprehend the status of each subject and perform their day-to-day tasks in the most efficient manner.
We have implemented a complete multi-tenant background screening SaaS system that was able to handle hundreds of candidates per day. It automates labor-intensive tasks common to the standard workflows, and results in a faster and more accurate process.
The coverage of the dataset is much wider compared to what it was before. Any suspicious activity now comes into focus more readily and helps an organization asses all relevant risks involved with the hiring process.
Our solution was developed using .NET Core stack.
The solution is designed using microservices architecture and is targeting cloud environments. It consists of multiple services, including:
System uses different information providers to collect data. Providers are categorized into different groups:
Derogatory News Search
Our background check system is meant to be deployed in a cloud environment (Azure, Amazon...) and supports both vertical and horizontal scalability. It is utilizing cloud database and cloud storage solutions. Data at rest is secured on all levels.
Access to the system is restricted using IP address and user credentials. System supports multi- factor authentication.
AWS Lambda (or an alternative cloud service) can be used to execute service operations. This is useful in asynchronous scenarios where providers don't return response immediately. Scheduled jobs are executed to perform status checks in specified intervals. That way we just use resources when we need them instead of creating idle dedicated resources.
Communication between service instances is done through MQTT protocol and RabbitMQ is used as a message broker.
To ensure easy maintenance, each service instance can be configured for a specific service provider. Multiple proxies are supported per each instance to avoid usage throttling limitations and IP restrictions imposed by source sites. A lot of work was invested into optimizing memory usage and proxy scheduling.
We have developed advanced techniques for handling Captcha challenges while making sure that none of target data sources are subject to overload or any other kind of unethical behavior. Multiple provider endpoints can be accessed in parallel via proxies, resulting in very fast data collection times. Operations that need hours using standard web crawlers are compressed to minutes. Advanced system configuration allows administrators to configure number of parallel tasks per domain, per proxy, limit total number of tasks - all depending on hardware configuration and other infrastructure used.
Scraping service is the main part of this solution and is responsible for collecting and rating background check information.
Background check starts by user submitting search subject information. User submits all information he has access to – Name, SSN, DOB, known addresses and aliases, etc.
System contacts data providers from the Id Verification category and confirms subject identity. It tries to find unique match for the subject of the screening and if it does, it automatically selects that subject for later searches, collects aliases, addresses and other information that are re-used in later searches.
In case there are multiple subjects returned and system cannot determine subject identity with 100% confidence, user interaction is required.
After all available data is collected, user can select which data will be used in later searches (aliases, addresses etc.).
After all searches are performed, user can review collected data, select content that needs to be included in the final report, or alter automatic categorization. System tracks all user actions and is capable to make more accurate future decisions depending on the previous decisions and user actions on them.
In some cases, collecting data can take a long time (more than a few minutes). That happens mostly on providers that provide credit and court records because some of them prepare search results manually. In those cases, system does periodical checks for status update and notifies user when status changes.
Formatted Provider Responses
Some of the providers return formatted results. In such cases, system is capable of parsing complete response and extracting all data automatically and include that information in final report. Formatted responses can contain information about aliases, addresses (current and history), companies, business partners, court records, credit report on various credit cards, fees etc. Custom rating algorithm is used to determine relevancy of all formatted content.
Articles, blogs, social media...
Our background check system collects information from sources that don't provide formatted response - articles, blog posts, social media stories, PDF, Word and other documents types.
Instead of just collecting this unstructured content, system uses advanced text filtering and content rating techniques to determine content that is relevant and then categorizes results by relevance.
We are using a combination of natural language processing services such as Amazon Comprehend and our custom solutions for named entity recognition. It is used to discover entities in downloaded content and calculate their relevance. This way, only relevant user information is presented to users, eliminating huge amounts of non-relevant information that is typically found in manual and semi-automatic searches.
Book a free consultation
Let us know what would you like to do. We will probably have some ideas on how to do it.