Preview

Nauchnyi dialog

Advanced search

Typological Differences of Natural and Neural Network-Generated Texts in a Quantitative Aspect

https://doi.org/10.24224/2227-1295-2023-12-7-47-65

Abstract

The authors of this article identify distinctive features in texts written by humans and texts generated by the GPT-3 neural network. Texts generated by GPT-3 have not yet been subject to systematic in-depth study. In total, 160 texts were analyzed in the article, distributed across four topics (“Higher Education in My Eyes,” “How to Remain Human in Inhuman Conditions,” “How I Spent the Summer,” “Teacher of the Year”), with 80 texts generated by the neural network and 80 texts written by humans. The texts were analyzed using quantitative linguistic methods. A concordance was compiled for each text using the AntConc program, from which quantitative values were obtained for further analysis. The authors reached the following conclusions: (1) in the generated texts, words included in the title occur with the highest frequency; (2) the relative frequency of words included in the title is unreasonably inflated; (3) the list of the 20 most frequent words in all generated texts includes the highest number of full-fledged words; (4) the lexical diversity coefficient in the examined natural texts is significantly higher than that of the generated texts. The findings of this research can be useful for both educators and machine learning specialists. 

About the Authors

R. E. Telpov
Pushkin State Russian Language Institute
Russian Federation

Roman E. Telpov, PhD in Philology, Associate Professor, Department of General and Russian Linguistics

Moscow



S. V. Lartsina
Pushkin State Russian Language Institute
Russian Federation

Stanislava V. Lartsina, Master’s degree student, Department of General and Russian Linguistics

Moscow



References

1. Borunov, A. B. (2017). Diversity of speech and methods of measuring it in text (linguostatistical approach). Litera, 4: 81—86. (In Russ.).

2. Burnashev, R. F., Alamova, A. S. (2022). Quantitative linguistics and artificial intelligence. Science and Education, 3(2): 1390—1402. (In Russ.).

3. Burnashev, R. F., Alamova, A. S. (2023). The role of neural networks in linguistic research. Science and Education, 3: 258—269. (In Russ.).

4. Cohen, A., Mantegna, R., Havlin, S. (2011). Numerical Analysis of Word Frequencies in Artificial and Natural Language Texts. Fractals, 5 (1): 1—19. DOI: 10.1142/S0218348X97000103.

5. Dale, R. (2021). GPT-3: What’s it good for? Natural Language Engineering, 27 (1): 113— 118. DOI: 10.1017/S1351324920000601.

6. Dinesh, K., Nathan, S. (2023). Study and Analysis of Chat GPT and its Impact on Different Fields of Study. International Journal of Innovative Science and Research Technology (IJISRT), 8 (3): 827—833. DOI: 10.5281/zenodo.7767675.

7. Floridi, L., Chiriatt, M. (2020). GPT-3: Its Nature, Scope, Limits, and Consequences. Minds and Machines, 30 (2): 1—14. DOI: 10.1007/s11023-020-09548-1.

8. Galushkin, A. I. (2022). Neural networks. In: Great Russian Encyclopedia, 16 november. Available at: https://old.bigenc.ru/technology_and_technique/text/4114009 (accessed 06.20.2023). (In Russ.).

9. Golovin, B. N. (1971). Language and statistics. Moscow: Education. 190 p. (In Russ.).

10. Kettunen, K. (2014). Can Type-Token Ratio be Used to Show Morphological Complexity of Languages? Journal of Quantitative Linguistics, 21(3): 223—245. DOI: 10.1080/09296174.2014.911506.

11. Klee, T., Gavin, W. J., Stokes, S. F. (2017). Utterance length and lexical diversity in American and British–English speaking children: What is the evidence for a clinical marker of SLI? In: Language Disorders From a Developmental Perspective. New York. 103—140. DOI: 10.4324/9781315092041-4.

12. Krasnoyarov, A. Yu., Arguzova, M. A., Khuzhamuradov, Zh. A., Rakhimov, S. R. (2022). “Speech creativity” of artificial intelligence: what texts a machine writes and how they differ from human ones. Social and Humanitarian Sciences. Domestic and foreign literature. Episode 6: Linguistics. Abstract journal, 2: 41—49. DOI: 10.31249/ling/2022.02.02. (In Russ.).

13. McCarthy, P. M., Jarvis, S. (2007). Voc-D: a theoretical and empirical evaluation. Language Testing, 24 (4): 459—488. DOI: 10.1177/0265532207080767.

14. McCarthy, P. M., Jarvis, S. (2010). MTLD, vocd-D, and HD-D: a validation study of sophisticated approaches to lexical diversity assessment. Behavior Research Methods, 42 (2): 381—392.

15. Nasyrova, G. N., Amonova, Sh. Kh., Burnashev, R. F. (2022). Review of modern services and software of quantitative linguistics. Science and Education, 3 (12): 450—462. (In Russ.).

16. Qaiser, S., Ali, R. (2018). Text Mining: Use of TF-IDF to Examine the Relevance of Words to Documents. International Journal of Computer Applications, 181 (1): 25—29.

17. Somers, H. H. (1966). Statistical methods in literary analysis. In: The Computer and Literary Style. Kent, OH: Kent State University. 128—140.

18. Tweedie, F. J., Baayen, R. H. (1998.).How variable may a constant be? Measures of lexical richness in perspective. Computers and the Humanities, 32: 323—352.

19. Yukhan, T. (1987). Problems and methods of quantitative-systematic research of vocabulary. Tallinn: Valgus. 204 p. (In Russ.).

20. Zakharova, E. Yu., Savina, O. Yu. (2020). Lexical diversity of text and ways of measuring it. Bulletin of Tyumen State University. Humanities studies. Humanities, 6 (1): 20—34. DOI: 10.21684/2411-197X-2020-6-1-20-34. (In Russ.).

21. Zenker, F., Kyle, K. (2021). Investigating minimum text lengths for lexical diversity indices. Assessing Writing, 47 (2). DOI: 10.1016/j.asw.2020.100505.


Review

For citations:


Telpov R.E., Lartsina S.V. Typological Differences of Natural and Neural Network-Generated Texts in a Quantitative Aspect. Nauchnyi dialog. 2023;12(7):47-65. (In Russ.) https://doi.org/10.24224/2227-1295-2023-12-7-47-65

Views: 643


Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.


ISSN 2225-756X (Print)
ISSN 2227-1295 (Online)