WEEK 1


Day 1

On this first day, I meet my mentor for an Intern-Kickoff Session.

He gave us some valuable insights into NLP++ and its importance in the modern-day world. Following this we had a short session on discussing the syntax and working on NLP++ where a few pointers were given to get started off with building an analyzer for USPS address format. 

Having worked on it a little bit previously, I started making the major changes suggested by my mentor, which would make the analyzer a lot better for processing the address. 

The main things that are common to address of all formats are fields like name, street name, state, pincode, etc. So I first focussed on retrieving the names, and states from the address. The procedure to create the analyzer for one such element-name has been explained here.

  • Generally, names start with designations like - Mr, Miss, Mrs, etc. Therefore I built a dictionary with these designations to give them attributes(like mister, miss which can be alternatively used as well) so that they can be located in the parse tree. 
  • Any number of words following the designation but within the same single line generally qualify as the full name in a postal address according to the USPS format.
  • There is an inbuilt dictionary that consists of first names and last names. On the basis of this, even if the designation is not there, any first name followed by any number of words in that particular line ended by the last name qualifies as the full name(Eg: Walter R Witherspoon Jr)

Similarly, I worked on the state and pin-code of the address. 

Here are the analyzer passes that create this and also this knowledge is put into the Knowledge base.



















Day 2:

On this day, I continued with the building of the address analyzer file and came across some newer and interesting NLP++ functions which I have incorporated in my NLP++ Code

I further read through the USPS address format documentation and found specific elements like Rural Route, Highway Contract, Military addresses, etc each of which has a specific format. I also found that some addresses may have company names which were conflicting with the name field. Therefore all these items were isolated and added to the knowledge base by writing NLP++ Code and functions.

Additionally, I went through NLP++ tutorial videos of my mentor pertaining to my Resume Analyzer project and some presentations of previous interns who have worked with this technology just to get an idea of how they have done things.



Day 3:

This day marked the continuity of building and refining of the address analyzer file.

The outcome of this is a knowledge base containing all the information that has been gathered from the addresses individually.

Next, I started focusing on the specifics of my project wherein I started gathering words that are commonly used as headers in Resumes. I also started looking for resume datasets that I can use in my project. I also started on a Python code that can convert text found in .pdf / .docx files into a .txt file which is the file generally processed by NLP++.


Day 4:

Today I was successful in finding the best Python library for converting pdf files to text. Although libraries like PyPdf, Doc2txt, etc have been known for doing the same, I noticed that any resume having 2 column format(a very popular format) was being converted in a wrong fashion as they convert line by line, therefore, lines of 2 columns were being added up as a single line. Therefore after searching and trying out different libraries and writing some code I was able to extract text for different pdfs.

From the resume datasets collected yesterday, I converted all of them into .txt files. A total of approximately 2500 resumes have been converted and stored as of now.


Day 5:

On this day, I spent some more time on NLP++. I went through the example analyzers which had been created by the developer and looked at different functions that have been used and how they work during execution. I also went through the Knowledge Base functions which although difficult to grasp, are very important for building or playing around with the Knowledge base.

I also went through some tutorial videos again to clarify some doubts so that I can prepare myself for working with the software in the coming weeks


No comments:

Post a Comment

  With today's increasing digitalization, it has been observed that the vast majority of processes are becoming automated. One such occu...