WEEK 5

 

DAY 1:

Today we had a meeting with the HPCC Systems platform team where we were introduced to Mr. Stuart Ort and subsequently, the interns introduced themselves and their projects to the team.

After this, in my meeting with my mentor, I discussed how the bullets and prose gathering were incorporated into the rest of the analyzer.
As mentioned last week, some gaps were bridged after gathering up a prose or sequence of lines.

 Therefore I now wanted to run the dates analyzer for text both under _LINE and _prose. Here although @NODES seemingly fits the description of working like running the analyzer sequence on the text under all the nodes mentioned here, it doesn't work like that and is not advisable. 

Therefore after looking through the documentation, I came across @MULTI which works exactly how we want but, the computation time taken may be longer for large parse trees.
But it is alright in this case, as resumes have limited length. 

I applied this accordingly and got desired results(The dates were all recognized appropriately).

DAY 2&3:

Next, I worked on the Skills, Hobbies(if mentioned), Programming Languages, Languages spoken etc. 
  1. Since I had a dictionary for the languages, I just used @MULTI node to look under _lang and stored this information in a text file.
  2. The same thing I did with Programming languages. In this way even if the person hasn’t mentioned it per-se in his skills but might have used it in another section like – the project he worked on, that information is analyzed.
  3. The Hobbies have been isolated by taking the bullets/ lines/ prose under the header. But a person may not necessarily use the word “Hobbies” alone. He may use “My Hobby”, “My Hobbies”, “My Interest”, “Interests”, “General Interests” etc. Therefore the logic used here was that the header text in a header zone, if contains – “hobb” or “interest”, we look under that for hobbies/interests.
  4. Same thing was done for Skills and Languages.

DAY 4&5:

I was in a slump on how to work with education.
  • Firstly I worked on the grades- (%, cgpa, sgpa, gpa, percentile etc.) and the number figures accordingly. I was able to isolate the values accordingly.
  • The real problem I faced was with educational institutions. General Machine Learning algorithms take up a dataset to learn names. Even language models like ChatGPT also can isolate names simply based on the knowledge they have- i.e. almost like having a database of names
  • But this is not exactly a good basis for linguistic analysis. Although a dictionary of all names can be created, I wanted to try something different.

I got a good idea, after some brainstorming, of isolating the name of the educational institute. To know that it is an educational place, we have a limited vocabulary-
 

So I created a dictionary for the same. 
Here was the real ideation part-  We know that all names have to be capitalized and therefore we look before “_edu” one by one and assign the attribute “ispartof” to it. 

After the “_edu” our vocabulary should have capitalized words and words like “of”, ”and”, “for” etc. This is a conclusion I came to after looking at 2000 college names. 

Also before _edu, we can also have names like “R.V.” College….. So another analysis that had to be done is that if we found a “.” Then it needs to be preceded by a capitalized letter(length =I made this all a part of a function that assigns attributes to nodes accordingly.

Therefore we look at a plus match of wild cards continuously having “ispartof” attribute so that we can group the name of the institution.

I ran this analyzer on lots of pieces of text and it isolated the names almost perfectly. Here is the output:




I also built a dictionary of all cities in India. This was done so that there will be no mixup in the name of the college and the city name since that is also capitalized. It also paves the way for further information extraction – like if the city name is given in the same line/prose/bullet or in the succeeding line, it indicates the location of the school.


No comments:

Post a Comment

  With today's increasing digitalization, it has been observed that the vast majority of processes are becoming automated. One such occu...