Assigned
Status Update
Comments
gs...@google.com <gs...@google.com> #2
Hello Donguk,
It depends to what definition of "lemma" you refer. As an example,dictionary.com has it as: "(linguistics) a word considered as its citation form together with all the inflected forms. For example, the lemma go consists of go together with goes, going, went, and gone". Oxford Dictionaries: "A word or phrase defined in a dictionary or entered in a word list."
The Natural Language API implementation as for now provides a lemma for each part of speech, so one for each of: noun, verb, adjective, adverb, pronoun, preposition, conjunction, interjection, numeral, article. In your example the lemma "industry" is provided as a noun lemma, "industrial" as an adjective lemma, "passionately" as an adverb lemma.
If you would like your feature request to enjoy improved chances of implementation, you are encouraged to provide a good use case, well described in detail, and showing the benefits you would derive from the feature's implementation.
It depends to what definition of "lemma" you refer. As an example,
The Natural Language API implementation as for now provides a lemma for each part of speech, so one for each of: noun, verb, adjective, adverb, pronoun, preposition, conjunction, interjection, numeral, article. In your example the lemma "industry" is provided as a noun lemma, "industrial" as an adjective lemma, "passionately" as an adverb lemma.
If you would like your feature request to enjoy improved chances of implementation, you are encouraged to provide a good use case, well described in detail, and showing the benefits you would derive from the feature's implementation.
[Deleted User] <[Deleted User]> #3
In my case, I am extracting simpler lemmas for topic modeling with Latent Dirichlet allocation algorithm.
This algorithm studies input documents for generating given number of topics.
Each topic is likelihood that a word would appear if a document is written in the topic.
Each document is considered to be consists of the topics with percentages.
For example, say I have some documents that only contain words dog and cat.
and I used LDA with the number of topics set to two.
Then let's assume the result topics T1 consists of 100% with dog, and T2 consists of 100% cat.
If someone writes a document is 100% written with T1, then each word in the document will be dog with 100% possibility.
If someone writes a document 50% with topic T1 and another 50 % with T2, then when he writes a word, it will be dog or cat in 50 vs 50 chance.
Well, basically, a topic is made fro the input words in the documents, and I do not think differentiating a word that shares the same meaning but have different part of speech would be good for topic analysis.
For example let's say there are two short documents like bellow.
document 1.
industrial revolution.
industrial change.
document 2
history of industrialization.
revolution in industrialization.
If topic analysis is done with current lemmas provided by Google NL API, then these two documents do not have any sharing words, even though a person can easily recognize that the two documents are similar.
However, if the analysis is done with the simpler lemma I mentioned, then the two documents would be considered to share a lemma 'industri' which consists of almost 50 % of content in each document.
For topic analysis, or some other form of document classifying algorithm, I think providing the feature that I am requesting would be helpful.
This algorithm studies input documents for generating given number of topics.
Each topic is likelihood that a word would appear if a document is written in the topic.
Each document is considered to be consists of the topics with percentages.
For example, say I have some documents that only contain words dog and cat.
and I used LDA with the number of topics set to two.
Then let's assume the result topics T1 consists of 100% with dog, and T2 consists of 100% cat.
If someone writes a document is 100% written with T1, then each word in the document will be dog with 100% possibility.
If someone writes a document 50% with topic T1 and another 50 % with T2, then when he writes a word, it will be dog or cat in 50 vs 50 chance.
Well, basically, a topic is made fro the input words in the documents, and I do not think differentiating a word that shares the same meaning but have different part of speech would be good for topic analysis.
For example let's say there are two short documents like bellow.
document 1.
industrial revolution.
industrial change.
document 2
history of industrialization.
revolution in industrialization.
If topic analysis is done with current lemmas provided by Google NL API, then these two documents do not have any sharing words, even though a person can easily recognize that the two documents are similar.
However, if the analysis is done with the simpler lemma I mentioned, then the two documents would be considered to share a lemma 'industri' which consists of almost 50 % of content in each document.
For topic analysis, or some other form of document classifying algorithm, I think providing the feature that I am requesting would be helpful.
gs...@google.com <gs...@google.com> #4
Hello Donguk,
Your detailed use case is much appreciated. Developers have been made aware of your feature request and will decide on the course of action. You can follow developments by watching this thread.
Your detailed use case is much appreciated. Developers have been made aware of your feature request and will decide on the course of action. You can follow developments by watching this thread.
Description
passion -> passion
passionate -> passionate
passionately -> passionately
industry -> industry
industrial -> industrial
industrialize -> industrialize
industrialization -> industrialization
while nltk wordnet lemmantizer gives simpler lemmas as shown bellow
passion -> passion
passionate -> passion
passionately -> passion
industry -> industri
industrial -> industri
industrialize -> industri
industrialization -> industri
This is similar for Japanese.
For example, given past tense, and present tense of a Japanese verb that means change. It returns different lemmas.
変わる -> 変わる
変わった -> 変わっ and た (変わった is considered to be two separate tokens 変わっ and た in google natural language api)
while Mecab Japanese stemmer gives the same lemmas for the two texts as shown bellow
変わる -> 変わる
変わった -> 変わる
I hope syntactic analysis api to have some option that make the api to return simpler form of lemmas as shown in nltk's wordnet lemmantizer or Mecab.