Shallow Question Answering using NLP

Khuzaima Pishori
2 min readApr 14, 2022

Generally speaking, a thesaurus can be considered as a list of words with other words related to them. For example1 car has the following synonyms: auto, automobile, machine, motor, motorcar, motor vehicle; and its related words are: bus, coach, minibus; beach buggy, brougham, compact, convertible, coupe, dune buggy, fastback, gas-guzzler, hardtop, hatchback, hot rod, jeep, limousine, roadster, sedan, sports car, station wagon, stock car, subcompact, van; flivver, jalopy.

Related words can be found generally in two ways:

• Manually built — Made by humans who collect the set of related words which they consider that can be associated to a specific word.

• Automatically built (distributional thesaurus) — Built automatically from a selected corpus. They obtain related words usually by comparing contexts on which the word is used. For example, for the phrases drink juice, drink lemonade, make juice, make lemonade, delicious juice, delicious lemonade, then it might have enough evidence to conclude that juice and lemonade are related words. On the other hand, thesaurus can be classified by their domain:

  • Specific domain — Thesauri (manual or distributional) built based on a particular subject
  • General domain — They have words from different areas, covering a general vocabulary.

Thesauri have been extensively used for query expansion, so we expected that our system would benefit from their usage. We used mainly two tools:

(1)A distributional thesaurus built from the Encarta Encyclopedia.

(2) A manual thesaurus based on the Anaya dictionary. We expand the query (with already omitted words): go procedure representative election alumni council scholar to, for example: go transport take guide drive use procedure representative alumni student council meeting scholar educational. Some words have a greater number of related words than others; some words are not present in the thesaurus, so they are not expanded. We used two thesauri for these experiments:

1. A distributional thesaurus created from the Encyclopedia Encarta.

2. A thesaurus based on the Anaya dictionary using the synonyms from that dictionary.

Another NLP technique used in this work was lemmatizing. Although this is a language-dependent resource, many languages have a lemmatize; moreover, it is possible to lemmatize unsupervised.

--

--