[Feature request] Provide an option in Syntactic analysis api that returns simpler form of lemma [62365407]

Assigned

Feature Request

Status Update

No update yet.

Description

[Deleted User]

created issue #1

Jun 6, 2017 09:46AM

For example, current syntactic analysis api returns following lemmas for given tokens

passion -> passion
passionate -> passionate
passionately -> passionately

industry -> industry
industrial -> industrial
industrialize -> industrialize
industrialization -> industrialization

while nltk wordnet lemmantizer gives simpler lemmas as shown bellow

passion -> passion
passionate -> passion
passionately -> passion

industry -> industri
industrial -> industri
industrialize -> industri
industrialization -> industri

This is similar for Japanese.
For example, given past tense, and present tense of a Japanese verb that means change. It returns different lemmas.
変わる -> 変わる
変わった -> 変わっ and た　(変わった is considered to be two separate tokens 変わっ and た in google natural language api)

while Mecab Japanese stemmer gives the same lemmas for the two texts as shown bellow

変わる -> 変わる
変わった -> 変わる

I hope syntactic analysis api to have some option that make the api to return simpler form of lemmas as shown in nltk's wordnet lemmantizer or Mecab.

Comments

gs...@google.com <gs...@google.com> #2Jun 7, 2017 05:11PM

Assigned to gs...@google.com.

Hello Donguk,

It depends to what definition of "lemma" you refer. As an example,

dictionary.com has it as: "(linguistics) a word considered as its citation form together with all the inflected forms. For example, the lemma go consists of go together with goes, going, went, and gone". Oxford Dictionaries: "A word or phrase defined in a dictionary or entered in a word list."

The Natural Language API implementation as for now provides a lemma for each part of speech, so one for each of: noun, verb, adjective, adverb, pronoun, preposition, conjunction, interjection, numeral, article. In your example the lemma "industry" is provided as a noun lemma, "industrial" as an adjective lemma, "passionately" as an adverb lemma.

If you would like your feature request to enjoy improved chances of implementation, you are encouraged to provide a good use case, well described in detail, and showing the benefits you would derive from the feature's implementation.

[Deleted User] <[Deleted User]> #3Jun 8, 2017 02:20AM

In my case, I am extracting simpler lemmas for topic modeling with Latent Dirichlet allocation algorithm.
This algorithm studies input documents for generating given number of topics.

Each topic is likelihood that a word would appear if a document is written in the topic.
Each document is considered to be consists of the topics with percentages.

For example, say I have some documents that only contain words dog and cat.
and I used LDA with the number of topics set to two.
Then let's assume the result topics T1 consists of 100% with dog, and T2 consists of 100% cat.
If someone writes a document is 100% written with T1, then each word in the document will be dog with 100% possibility.
If someone writes a document 50% with topic T1 and another 50 % with T2, then when he writes a word, it will be dog or cat in 50 vs 50 chance.

Well, basically, a topic is made fro the input words in the documents, and I do not think differentiating a word that shares the same meaning but have different part of speech would be good for topic analysis.

For example let's say there are two short documents like bellow.

document 1.
industrial revolution.
industrial change.

document 2

history of industrialization.
revolution in industrialization.

If topic analysis is done with current lemmas provided by Google NL API, then these two documents do not have any sharing words, even though a person can easily recognize that the two documents are similar.

However, if the analysis is done with the simpler lemma I mentioned, then the two documents would be considered to share a lemma 'industri' which consists of almost 50 % of content in each document.

For topic analysis, or some other form of document classifying algorithm, I think providing the feature that I am requesting would be helpful.

gs...@google.com <gs...@google.com> #4Jun 8, 2017 07:03PM

Reassigned to gc...@google.com.

Hello Donguk,

Your detailed use case is much appreciated. Developers have been made aware of your feature request and will decide on the course of action. You can follow developments by watching this thread.

Issue 62365407

Description

Issue summary

Comments

gs...@google.com <gs...@google.com> #2Jun 7, 2017 05:11PM

[Deleted User] <[Deleted User]> #3Jun 8, 2017 02:20AM

gs...@google.com <gs...@google.com> #4Jun 8, 2017 07:03PM

Add comment

Issue metadata