Day 1
On this first day, I meet my mentor for an Intern-Kickoff Session.
He gave us some valuable insights into NLP++ and its importance in the modern-day world. Following this we had a short session on discussing the syntax and working on NLP++ where a few pointers were given to get started off with building an analyzer for USPS address format.
Having worked on it a little bit previously, I started making the major changes suggested by my mentor, which would make the analyzer a lot better for processing the address.
The main things that are common to address of all formats are fields like name, street name, state, pincode, etc. So I first focussed on retrieving the names, and states from the address. The procedure to create the analyzer for one such element-name has been explained here.
- Generally, names start with designations like - Mr, Miss, Mrs, etc. Therefore I built a dictionary with these designations to give them attributes(like mister, miss which can be alternatively used as well) so that they can be located in the parse tree.
- Any number of words following the designation but within the same single line generally qualify as the full name in a postal address according to the USPS format.
- There is an inbuilt dictionary that consists of first names and last names. On the basis of this, even if the designation is not there, any first name followed by any number of words in that particular line ended by the last name qualifies as the full name(Eg: Walter R Witherspoon Jr)
Similarly, I worked on the state and pin-code of the address.
Here are the analyzer passes that create this and also this knowledge is put into the Knowledge base.
On this day, I continued with the building of the address analyzer file and came across some newer and interesting NLP++ functions which I have incorporated in my NLP++ Code
I further read through the USPS address format documentation and found specific elements like Rural Route, Highway Contract, Military addresses, etc each of which has a specific format. I also found that some addresses may have company names which were conflicting with the name field. Therefore all these items were isolated and added to the knowledge base by writing NLP++ Code and functions.
Additionally, I went through NLP++ tutorial videos of my mentor pertaining to my Resume Analyzer project and some presentations of previous interns who have worked with this technology just to get an idea of how they have done things.
Day 3:
This day marked the continuity of building and refining of the address analyzer file.
The outcome of this is a knowledge base containing all the information that has been gathered from the addresses individually.
Next, I started focusing on the specifics of my project wherein I started gathering words that are commonly used as headers in Resumes. I also started looking for resume datasets that I can use in my project. I also started on a Python code that can convert text found in .pdf / .docx files into a .txt file which is the file generally processed by NLP++.
No comments:
Post a Comment