Corpus

GENRE-CONTROLLED COMPARABLE AND PARALLEL CORPUS

The corpus built for the project is a genre-controlled comparable-parallel corpus of Polish and English institutional texts (Biel 2016). The main focus corpus of Polish Eurolect comprises ca. 4,000 texts and 15 million words while the main reference corpus of national administrative corpus covers over 4,000 texts and 12 million words.

https://eurolect.ils.uw.edu.pl/wp-content/uploads/sites/180/2020/12/Corpora-1024×398.png

The corpus comprises four genres which have been selected for the analysis as most prototypical and representative of EU communication:

legal acts,
judgments,
administrative reports
institutional websites.

The genres differ in terms of their function, discourse communities, modalities and translation arrangements:

Source: Biel, Koźbiał, and Wasilewska (2019).

The corpora cover the 5-year time frame of 2011-2015, except for websites which were collected as of 2015/2016. The Polish and corresponding English files were downloaded manually from November 2015 to early January 2016 and supplemented in June 2016.

The corpus of EU legislation is divided into two subcorpora: regulations (directly binding and applicable in member states) and directives (binding as to the objective to be achieved). Polish and corresponding English files were downloaded in html from the EUR-Lex database of legal acts with the following settings: links to the latest consolidated documents; browsing by subjects, document type (directive, regulation). To ensure a better comparability to Polish legal acts: (1) the files were sorted manually to exclude amending and repealing acts, corrigenda, implementing and delegated directives; (2) only enacting (normative) terms were extracted for the analysis. The reference corpus of Polish legal acts includes Polish statutes (ustawa) adopted in 2011-2015 in latest consolidated versions (in force as of 31 December 2015, excluding repealed acts) which were downloaded from the online database of Polish legislation Lex run by Wolters Kluwer SA. This corpus excludes amending acts (since their content is reflected in consolidated versions), acts ratifying international instruments and intergovernmental agreements, repealing acts and includes a sample of yearly budgetary acts.

The corpus of EU judgments includes judgments issued by two bodies of the Court of Justice of the European Union, i.e. the Court of Justice and the General Court, in English and Polish language versions. The judgments were downloaded with the closed case status as of 31 December 2015 from the InfoCuria database, with the delivery date from 1.1.2011 to 31.12.2015 and published in the ECR. The reference corpus of Polish judgments includes judgments (wyroki) issued by the Civil Chamber of the Supreme Court of the Republic of Poland and downloaded from the Supreme Court judgement database. See Koźbiał (2020) for a detailed description of the corpora of judgments.

Within the genre of reports, the corpora compared are: reports prepared by the Directorates General (DGs) of the European Commission, downloaded from the register of Commission documents in Polish and English language versions, and reports prepared by Polish ministries, downloaded from the websites of respective ministries. For better comparability, the texts in the PL Reports corpus are limited to those published by ministries corresponding in the area of competence to the DGs, i.e. reports by ministries such as the Ministry of National Defence or the State Treasury were not included.

The last component of the Eurolect corpus is the corpus of official institutional websites for citizens. The EU corpus comprises official websites of EU institutions (the European Commission, the European Parliament, the European Council/the Council of the European Union) and the official inter-institutional website of the European Union EUROPA.eu, run by the European Commission and addressed to the general public. Only pages with corresponding Polish translation were included. The reference corpus includes comprises websites of the Polish institutions, functionally corresponding to the chosen EU bodies: the Prime Minister’s Office (KPRM), Sejm (lower house of the Parliament), seven Ministries (Internal Affairs, Digitalisation, Finance, Infrastructure and Construction, Environment, Health, Maritime Economy and Inland Navigation); the Obywatel (‘Citizen’) website, Rodzina (‘Family’) and Rodzicielski (‘Parental’) portals run by the Ministry of Labour and Social Policy. The corpus comprises only sections presenting general information on the institutions and guidance addressed to the citizens. The news section and specialised texts posted on the websites were excluded during the compilation of the corpus.

Additional corpora were used to answer specific research questions:

The pre-accession versions (1999-2000) of Polish texts for each genre to study the europeanisation of the Polish legal language;
The micro-diachronic corpus of the pre-accession Polish Eurolect (1999-2003) to study the institutionalisation and evolution of the Polish Eurolect;
The parallel English-Polish corpus to study terminological variation (Biel and Koźbiał 2020);
A corpus of judgments according to the language of the case (Müller forthcoming).

In order to avoid the “difference mindset” (Baker 2010, 153), the study also uses a big general reference corpus of contemporary Polish —a balanced version of the National Corpus of Polish (NKJP). It functions as a representative sample of contemporary Polish and a benchmark for the interpretation of translation data.

The corpus study was conducted in Wordsmith 7.0 (Scott 2016) and Sketch Engine (Kilgarriff et al. 2014).

References

Baker, Paul. 2010. Sociolinguistics and Corpus Linguistics. Edinburgh University Press: Edinburgh.

Biel, Łucja. 2016. “Mixed corpus design for researching the Eurolect: a genre-based comparable-parallel corpus in the PL EUROLECT project.” In Polskojęzyczne korpusy równoległe. Polish-language Parallel Corpora, edited by Ewa Gruszczyńska and Agnieszka Leńko-Szymańska, 198-208. Warszawa: Instytut Lingwistyki Stosowanej.

Biel, Łucja, and Dariusz Koźbiał. 2020. “How do translators handle (near-) synonymous legal terms? A mixed-genre parallel corpus study into the variation of EU English-Polish competition law terminology.” Estudios de Traducción 10:69-90. doi: https://dx.doi.org/10.5209/estr.68054.

Biel, Łucja, Dariusz Koźbiał, and Katarzyna Wasilewska. 2019. “The formulaicity of translations across EU institutional genres: A corpus-driven analysis of lexical bundles in translated and non-translated language.” Translation Spaces 8 (1):67-92. doi: https://doi.org/10.1075/ts.00013.bie.

Kilgarriff, Adam, Vít Baisa, Jan Bušta, Miloš Jakubíček, Vojtěch Kovář, Jan Michelfeit, Pavel Rychlý, and Vít Suchomel. 2014. “The Sketch Engine: ten years on.” Lexicography 1:7-36.

Koźbiał, Dariusz. 2020. The Language of EU and Polish Judges: Investigating Textual Fit through Corpus Methods. Berlin: Peter Lang.

Müller, Dariusz. forthcoming. “The EU melting pot of languages: how the language of the case (English, French, Polish) influences the language of the CJ’s Polish Judgments.” East Journal of Translation.

Scott, Mike. 2016. WordSmith Tools version 7. Stroud: Lexical Analysis Software.

Download the corpus

The Polish Eurolect corpus is freely available at:

https://www.ils.uw.edu.pl/wp-
content/uploads//2020/06/PL_Eurolect_corpora.zip.