Introduction

In standard Russian, all third person pronouns in prepositional constructions must have initial [n]. However, in some dialects, it is not the case. Mikhalevskaya village dialect is characterized with forms without the initial [n], which is one of its dialectal features. However, the language of the speakers is becoming standardized and there are less and less occurrences of dialectal forms. The data set used in this project consists of 1015 observations. It includes the following variables. Output variable is absence or presence of the initial [n] (categorical). Input variables are informants’ ID (categorical), their year of birth (numerical), gender (categorical: male or female), education level (categorical) as well as some variables that characterize prepositional constructions: type (categorical) and frequency (numerical) of the preposition, form (categorical) and case (categorical) of the pronoun. Our hypothesis is that sociolinguistic factors (such as age of the informants, their gender and education) might influence the proportion of dialectal forms. The main idea is that the younger the speakers are, the more forms with [n] he/she has. Another (weaker) supposition is that the higher the education level (i.e. the more the contact with the standard variant), again the more cases of the initial [n] we can observe. Other variables will serve as possible predictors, although we do not know in what degree and direction they can influence the absence or presence of [n].

Description of the Phenomenon

The phenomenon illustrated in this paper is one of the linguistic variables that differs dialect of Mikhalevskaya from the standard language. It was a result of reanalysis of constructions including prepositions vъn ‘in’, kъn ‘to’ и sъn ‘with’ (later expanded to other prepositions) with third person pronouns, which took place very early in the history of Russian. In modern standard Russian, the initial nasal in pronouns is obligatory in most prepositional constructions (primary prepositions), and is optional or even impossible in constructions with some prepositions. Examples of initial [n]: u n’ego ‘by him’, na n’ix ‘on them’, s n’im ‘with him’. On the contrary, in some Russian dialects, the initial nasal consonant after prepositions had been lost and became a dialectal feature.

Data Collection and the Data Base

The Ustja River Basin Corpus, that includes data collected in 2013 to 2016 during four field trips to Mikhalevskaya, the village in Ustya district of Arkhangelskaya Oblast, was the source of data this research is based on. It consists of of interviews, transcribed in standard Russian orthography and aligned with original audio (von Waldenfels et al. 2014). The data were collected through CQP-queries as follows: [lemma=“pronoun”] ::match.utterance_spkr=“speaker”. Instead of the word pronoun, a pronoun was included (он, она or они), and instead of the word speaker, the abbreviation of the selected speaker was included (пфп1928, авм1922 etc., where part in letters is an abbreviation of speaker’s name and numerical part is their year of birth). For example, the query [lemma=“он”] ::match.utterance_spkr=“пфп1928” allows us to find all forms of the pronoun он (singular) that were used by the speaker PFP born in 1928. The data includes third person pronouns, singular and plural, in oblique cases both in prepositional (1015 occurrences, 33 informants). Male and neuter pronouns were considered together, because they are not differentiated in the corpus annotation. Each pronominal form was examined for the presence or absence of initial nasal [n]. We did not register the initial sound in pronouns without nasal consonant (i.e. [j] or a vowel, which may itself be another parameter of variation), because there are many cases when determining the quality of the anlaut is very problematic. We only controlled whether the initial nasal [n] is present or not.

setwd("C:/Users/Василиса/Documents/MA_HSE/R Statistics")
df <- read.csv("pronouns.csv")
library(tidyverse)
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag():    dplyr, stats

We need to check what variables might be relevant for us in the research. They are:

summary(df)
##     speaker         year         gender              lives    
##  mdn1933: 73   Min.   :1922   female:761   Bestuzhevo   : 16  
##  pfp1928: 67   1st Qu.:1933   male  :254   Mikhalevskaya:941  
##  npo1965: 65   Median :1949                Plosskoe     : 58  
##  avm1922: 60   Mean   :1947                                   
##  nnt1960: 60   3rd Qu.:1960                                   
##  lgp1947: 59   Max.   :1996                                   
##  (Other):631                                                  
##                      born        education       index       
##  Mikhalevskaya         :602   high    :352   Min.   :  1352  
##  Plosskoe              :178   high-mid:282   1st Qu.:103055  
##  Bestuzhevo            : 83   low     : 42   Median :204706  
##  Lobanovo-Mikhalevskaya: 56   low-mid :339   Mean   :254985  
##  Fomin Pochinok        : 39                  3rd Qu.:388082  
##  Akichkin pochinok     : 30                  Max.   :757070  
##  (Other)               : 27                                  
##   preposition    prep_type      st_form     case     form     consonant
##  у      :581   initial:273   него   :326   acc: 55   f :257   no :560  
##  с      :194   later  :742   них    :240   dat: 81   m :466   yes:455  
##  к      : 74                 ней    :137   gen:649   pl:292            
##  на     : 48                 ним    :123   ins:212                     
##  за     : 34                 нее    :120   loc: 18                     
##  от     : 25                 ними   : 35                               
##  (Other): 59                 (Other): 34
str(df)
## 'data.frame':    1015 obs. of  13 variables:
##  $ speaker    : Factor w/ 33 levels "ait1954","ans1925",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ year       : int  1954 1954 1954 1954 1954 1954 1954 1954 1954 1954 ...
##  $ gender     : Factor w/ 2 levels "female","male": 1 1 1 1 1 1 1 1 1 1 ...
##  $ lives      : Factor w/ 3 levels "Bestuzhevo","Mikhalevskaya",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ born       : Factor w/ 8 levels "Akichkin pochinok",..: 6 6 6 6 6 6 6 6 6 6 ...
##  $ education  : Factor w/ 4 levels "high","high-mid",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ index      : int  197434 197446 216169 425772 228784 230299 230310 425760 672850 673064 ...
##  $ preposition: Factor w/ 25 levels "без","в","для",..: 24 24 24 24 23 8 24 24 24 24 ...
##  $ prep_type  : Factor w/ 2 levels "initial","later": 2 2 2 2 1 1 2 2 2 2 ...
##  $ st_form    : Factor w/ 8 levels "него","нее","ней",..: 8 1 1 1 7 6 8 8 1 1 ...
##  $ case       : Factor w/ 5 levels "acc","dat","gen",..: 3 3 3 3 4 2 3 3 3 3 ...
##  $ form       : Factor w/ 3 levels "f","m","pl": 3 2 2 2 3 3 3 3 2 2 ...
##  $ consonant  : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 2 2 ...

In order to see whether the frequency of the preposition has any influence on the pronunciation of the following pronoun, we need to calculate and add this information. For that, we create a separate table with frequencies and then add them in a column to the data frame with the help of inner_join (it is a numerical variable):

df %>%
  group_by(preposition) %>%
  summarise(prep_frequency = n()/1015) ->
  df_freq
df <- inner_join(df, df_freq)
## Joining, by = "preposition"

Descriptive Statistics

To begin with, we want to visualize the correlation between the year of birth and the absence or presence of [n]. For first, we will not differentiate between the speakers in order to see the general tendency. We draw a violin plot and see that, in general, there is a trend to have more observations without [n] among older speakers and with [n] among younger ones. But we must be careful because this kind of visualization does not take into account how many utterances alltogether there are in the interview from one speaker. It might display not the tendency but the disproportionality of the collected data.

df %>%
  ggplot(aes(consonant, year, fill = consonant, color = consonant)) +
  geom_violin(show.legend = FALSE) +
  labs(title = "Correlation between the year of birth and the absence / presence of [n]", x = "Initial [n]", y = "Year of birth") +
  theme_bw()

Therefore, let us draw a scatter plot considering each speaker separately in order to check the correlation between the year of birth and the empirical proportion of observations with [n]. As these are only observed proportions and not absolute measures, we also want to see and keep in mind what is the number of observations.

df %>%
  group_by(year, speaker, education) %>%
  summarise(prop_consonant = sum(consonant == "yes")/(sum(consonant == "yes") + sum(consonant == "no")), all_consonant = (sum(consonant == "yes") + sum(consonant == "no"))) %>%
  ggplot(aes(year, prop_consonant, color = all_consonant, label = speaker)) +
  geom_text(nudge_y = 0.02, size = 3) +
  geom_point() +
  labs(title = "Proportion of 3rd person pronoun forms with initial [n]: different speakers", subtitle = "Correlation with the year of birth", x = "Year of birth", y = "Proportion of forms with initial [n]", color = "Number of observations") +
  theme_bw()

As we also suppose that education level might have an impact on the dialectal performance on the speakers, we need to visualize it first. Let us display the education level on our scatter plot. We can observe that it is probably not fully independent variable and depends on the year of birth. Therefore, in our analysis we should keep in mind the option to consider the integration of these variables.

df %>%
  group_by(year, speaker, education) %>%
  summarise(prop_consonant = sum(consonant == "yes")/(sum(consonant == "yes") + sum(consonant == "no"))) %>%
  ggplot(aes(year, prop_consonant, colour = education, label = speaker)) +
  geom_text(nudge_y = 0.02, size = 3, show.legend = FALSE) +
  geom_point() +
  labs(title = "Proportion of 3rd person pronoun forms with initial [n]: different speakers", subtitle = "Correlation with the year of birth: with regard to the education level", x = "Year of birth", y = "Proportion of forms with initial [n]", color = "Education level") +
  theme_bw()

We also suppose the dependency on the gender, so let us display this variable on the scatter plot. Again, male speakers are mostly born in 1950-1970, so probably the sample is not perfect for the analysis and depends on the year. We chould check the correlation with the statistical methods.

df %>%
  group_by(year, speaker, gender) %>%
  summarise(prop_consonant = sum(consonant == "yes")/(sum(consonant == "yes") + sum(consonant == "no"))) %>%
  ggplot(aes(year, prop_consonant, color = gender, label = speaker)) +
  geom_text(nudge_y = 0.02, size = 3, show.legend = FALSE) +
  geom_point() +
  labs(title = "Proportion of 3rd person pronoun forms with initial [n]: different speakers", subtitle = "Correlation with the year of birth: with regard to the gender", x = "Year of birth", y = "Proportion of forms with initial [n]", color = "Gender") +
  theme_bw()

First, we want to check whether the linear regression model is good for our data. In order to do that, we should transform our data frame into a shorter format, so that each observation is not a pronoun with preposition but a speaker with a certain number or dialectal dial (without [n]) and innovative inn (with [n]) pronunciations. Then we plot our linear regression with the predictor year.

df %>%
  group_by(speaker, year, gender, education) %>%
  summarise(dial = sum(consonant=="no"), inn = sum(consonant=="yes")) ->
  num_df
num_df %>%
  mutate(perc = inn/(dial + inn)) %>%
  ggplot(aes(year, perc))+
  geom_point()+
  geom_smooth(method = "lm") +
  labs(title = "Proportion of 3rd person pronoun forms with initial [n]: different speakers&q