DAY 1:
I started this week by proceeding with finding the sentence enders so that I can isolate prose. But I noticed that "\." is not always a full stop but could also be in company names like Company ABC Ltd. Therefore I started with isolating company names.
I had a meeting with my mentor today where I discussed this and a lot of things were clarified. Firstly I started focusing too deeply on my example resume which is not a good basis for creating a generic analyzer. Secondly, if a list of words had to be compared against, then it was more advisable to use a dictionary.
Therefore I started collecting actual resumes from my friends and people I know, as no data is as good as real data. This was done to check whether the analyzer sequence worked how it was supposed to. As expected, it failed in a few areas and I started working on those. I started with the text partitioning again and isolating headers as best as I can.
Next, I created a dictionary for both domain extension(Link analysis) and company abbreviation(company name analysis).
DAY 2
On this day, I started refining all that I did in the previous week.
NLP++ programming is like that. There are times when we may feel that the rule is proper and that it covers all the sample examples that we may have worked on. But there always will arise border cases where the analyzer fails - which is why it is so important to retrace your steps and rectify stuff. There are times when I had to scrap all that I did and rebuild.
Next, I also refined the text zones as best as I could and the text partitioning worked well for all the test files. This was done by focusing on the dictionary for the most common resume headers.
I also built a dictionary for the most common Indian names.
DAY 3:
Today, I specifically worked on dates. Dates and date ranges can be expressed in a number of ways.
Some of them are:
I worked on the analyzer sequences to recognize these.
The First format was one containing only numbers:
- dd-mm-yyyy
- dd-mm-yy
- yyyy-mm-dd
- yy-m-d
- mm-yyyy
- dd/mm/yyyy
- dd/mm/yy
- yyyy/mm/dd
- yy/m/d
- mm/yyyy and some more variations.
The analyzer sequence also checks if the date values fall in the correct range.
Day(1-31) Month(1-12) Year(1900-2030). I didn't give attention to leap years for now and also to months having 30 days and 31 days.
The next format was the one that contained months like:
- 12 January 2021
- Dec 2,2011 etc.
I created a dictionary of month names (having both March and Mar) and used this accordingly in the sequence.
The last thing I worked on was date ranges of the following types:
- 2012-2013
- 2013 - present
- 12/11/12- 30/11/14
- from 2020 till 2023
- from 2020 till present
The analyzer sequence worked well on 90% of date types found in the resume dataset, but I found that it failed on the most basic one 01/02/07. I plan to work on this tomorrow.
DAY 4&5:
I found further shortcomings with the date analyzer like - 01/02 etc. I worked on all and improvised it.
Later, I had a meeting with my mentor where we discussed the dates analyzer. He gave me a better suggestion(an improvement) wherein I could create a function that looks at numbers in general and then assigns an attribute if it's a day / year / month.
Based on this, I created a new function accordingly and called this function early on in the analyzer.
This method helped me to vastly shorten and simplify code.
Here is an example of the difference:
After the conversion:
This is a very useful way to simplify and generalize the code.
I did this for all the rest passes of the dates analyzer sequence and also dealt with discrepancies like dates of type: "Month-year", "Month, year", and "Month-year". and also single year numbers (only in the ranges of 1900-2030).
No comments:
Post a Comment