DAY 1:
This week commenced with downloading the preliminary design of the Resume Analyzer by a previous intern Charvi Dave.
An important point to be noted while trying to download any sample analyzers or creating a new analyzer is that it should not be done in the common area where the example analyzers exist otherwise whenever a newer version of NLP++ and we get the update, the files are deleted from the common area which may lead to loss of analyzer files( I learned this the hard way)
I executed the analyzer on the gathered dataset of resumes and also spent time analyzing the first few passes created to know how it works, and what are the pieces of text being gathered by it.
I also collected all the possible words which can be used as headers in a Resume. This is essential as it helps us to divide the whole resume into a series of textual zones each of which gives a part of the resume (skills, certifications, etc.)
I compiled them alongside the existing dictionary of headers.
Dictionaries are an essential part of NLP++ as they help to assign meaning to specific words in the text helping us to categorize or group text.
DAY 2:
In continuation to the analysis of the Resume analyzer, further observations were made and points were noted to improvise my approach to the project. The text zones were perfectly categorized according to the header.
Passes can be made for analyzing small pieces of text like Email, Github, LinkedIn websites as they follow a fixed pattern.
This can be commonly followed for texts having formats. It can be extended for personal blogs, publications, etc.
I finished looking at all the passes and also referred to her presentation video for more reference.
DAY 3:
On this day, I brainstormed on the different ways of industrial applicability of resumes. I worked on counting the number of skills/ projects per person so that if a company had a requirement of people having more than 5 skills then this would help for the same. Although I was able to do it(By counting the number of bullets under select headers) It felt like a large tedious work and frankly unnecessary. In my weekly meeting with my mentor, I discussed the same and that's where I came to learn about the difference between NLP++ and general Querying(Natural Language Querying) which focused on the specifics of counts, etc. That's when I disbanded the idea.
Then I started rebuilding the analyzer files from scratch by borrowing some elements from the predecessor. I completed isolating the text zones(This was done by finding out the headers in the text and grouping text between headers as one zone)
DAY 4 and 5:
On this day, I focused specifically 2 elements of the Resume(if there) - Iinks(Email, LinkedIn, GitHub, Any Website, Blog, etc.). Although GitHub and LinkedIn have already been done, these were too specific and there was no general method. Also in the email - if the person used different characters and also any domain(google, yahoo, outlook, etc.), this also had to be identified. So I came up with an analyzer that covers all types and links.
Here is the code snippet for the same. The basic idea behind this is that be it any link it is bound to have a domain extension and is not allowed to have whitespaces.
The most commonly used domain extension are:
A similar idea was followed for the phone number. The phone numbers can have different formats:
But the common thing that I observed is that a number can either have 10-12 digits or 7(landline). That was the basis for my idea here. After a large number of trial and errors, the following analyzer sequence was created:
No comments:
Post a Comment