bottyan 2005.02.23 _Past lesson assignment_ Antal Laszlo: Egy uj magyar nyelvtan fele (masfel fejezet) Tobbfele elemzes - tobbfele nyelvtan - masodik fele a kettes reszrol A Magyar jelzo harom nyelvtani koncepcio fenyeben --- Corpus linguistics is descripitive linguistics aided by computers and a corpus database. Three main approaches to linguistic description. Why there are different descriptions? - structural ambiguities (e.g. the Hungarian participle) - different persons - subjective approaches - the aim of the description of the grammar (pedagogical/theoretical, etc.) Three types of synchronic descriptions? - traditional (from the Greeks) - structural - generative/transformative Traditional: content and formal considerations are mixed. Discrapancies: "a verb has to do with action" - notional definition, inherently flawed. Structural: formal structural approach: what syntactic positions verbs can appear. Syntagmatic and paradigmatic (morphological). There is also circularity in this. Semantic criteria are out at least. Antal: "Structural grammars do not mirror the generative dynamics of the speaker." In what sense are Generative Grammars "generative"? They try to generate sentences on a basic structure, with the application of certain transformation rules. They profess that any possible sentence can be produced by these rules. Competence and performance is the heart of the methodology of Generative Grammar. Generative Grammar aims to describe the competence, hence for example language errors are not part of Generative Grammar's descriptive area. [Mostly. There is a generative grammar of performance as well.] T-model: the architecture of GB Grammars. DS (deep structure) -- but later they became sceptical so The Structure (the starting level of the derivation) | ¢ rule move alpha | ¢ SS (surface structure) -- but then they renamed them Structures /\ PF LF (phonetic (logical form rules) form rules) Logical ambiguity is cared for by LF rules. LF is not connected to any other levels, however it is connected with intonation, but that's suprasegmental, and so not in the T-model. Everyone likes someone -- Ex : A y like x Someone is liked by everyone -- AxEy : x like y (A and E are supposed to be upside down in these representations.) The sound output is cared for by PF rules. -- There is also Minimalist Generative Grammar, but we do not really know what it is. -- 199X: Tompa Janos: "Mai magyar nyelvrendszer". 2000: new book. What is the definition in Hungarian? [???] A long description. "The meaning of the dog doesn't bite." [That's where Hegel's objective idealism would differ.] a. Egy kutya megy az uton. b. Egy feher kutya megy az uton. Feher modifies the meaning of the word 'kutya'. Frege. Meaning and Denotation. Sense and Reference. Jelentes es jelolet. All signs has sense but not all signs have reference. Russel: e.g. The present King of France is bald. 'present King of France' has a sense but it has no reference. Antal doesn't care about this. A jelzo nem a jelzett szo jelenteset modositja. But jelentes in Antal is not sense and not reference. It is the rule which tells you under which circumstances you can use a term. This will be important when we look at corpus databases - this will be a distributional definition, that Bottyan Gergo and Antal will explicate. _Computers work best if you are investigating the interface between lexes and syntax._ Gonosz vs. gonoszsag, feher vs. feherseg. What do traditional grammars confuse in connection with the meaning of the sign? They often take the meaning for the referent. Structural definition of attributes according to Antal. "A nominalis mondatreszek alanyesetu nominalis bovitmenyei." Nominal modifiers of nominals in ... case. a. A kutya haza. b. A kutyanak a haza. 'Kutyanak' is not an attribute according to the above definition, although it should be. Generative definition of attributes. p.150. "All structures can be modified into predicate modification structures." ----------- This is not a syntax course. The aim was to gather some characteristics of the other approaches, in order to show what problems corpus linguistics faces. Corpus linguistics is not (yet) claiming generality, as it is derivative, e.g. it has to look at the data first. In principle one could use corpus data to test the adequacy of other theories. Chomsky: I-language, E-language. E-language would be found in the corpus. This does not really interest Chomsky, as it does not directly reflect mental systems. Another thing Generative Grammar is not interested in: what structures are probable? Language variety is not an issue of Generative Grammarians, but it is of interest to liguists who take an empirical approach. Corpus linguists claim to have a Natural Science based research method. Examine data, form generalisations, form theories, test theories in data. Generative grammarians test according to intuition, CLs test according to a corpus. They are complimentary data collection methods. Based on corpus data you cannot tell what structures are NOT possible. NEXT TIME WE ARE IN COMPUTER ROOM. Read: corpus creation. 9 Qs. Ajtosi library find the text, or here at the porter's office. Magyar Nemzeti Szovegtar British National Corpus THE|END 2005.03.02 tagging - cleantextpolicy is like XML, dirtytext is like HTML Keyword-in-context Brown corpus - the pioneer database, USA Local grammar: e.g. "if the word form is adjacent to an article is cannot be a verb." -- the most difficult part of tagging "Cup of tea" - does not appear in the affirmative and in the negative BNC The Compleat Lexical Tutor - all kinds of resources 2005.03.09 Prievara Tibor - Computers and Language | Teaching Sulinet foreign language teacher material director 1. Internet is a virtual space, thus i use the Internet for as many purposes as i use physical space for. 2. - The essential basis of the acquisition of any language is to use it regularly for personal purposes (e.g. purposes of which the primary aim is not language acquisition itself). So i would encourage lls to use it anytime they want for their numerous special purposes. 3. - It is hugely distractive. nicenet.org [username] [password] webtanar2, ..., webtanar14 webtanar2, ..., webtanar14 CONFERENCING: The big trick: A forum! [Wow!] "Incredible!" _Safe environment_ - not everyone has access. LINKSHARING\WEBTANARCOURSE - 70 ideas on how to use the internet. BBC Hot Topics WEBFORM WEBPAGE CREATION - you can do a webpage in 10 minutes! D.FILM MOVIEMAKER (dfilm.com) - a way to have students present their home assignment (e.g. write-a-dialog), sends a link, watch it online "Mini saga" project - 50 word story. There are competitions in the US and England. Vocabulary teaching has been widely neglected up until the 80s, Even Chomsky neglected it. Corpus linguistics changed this. How much vocabulary is enough? 80 percent is only fossilised units that we retrieve. Why we use longer expressions than necessary? Because most times it is just necessary to use long expressions for various purposes. --- Our aim is to speak fluently. 6-7 seven words per second is fluent talk. We retrieve a lot of unanalysed chunks. We need teach: high frequency collocations. Vocabulary items that can be context independent. Formulate language. Idioms are a marginal part of language. Lexically dense text: written text. Formulaic language sticks. Definition: anything which bypasses syntactic parsing. Formulaic language may be an indicator of fluency. Textalizer - a free program. THE|END 2005.04.06 Text Encoding Initiative intro_xml_sgml.pdf XML editor probalgatas "Xaira" exe - beta tester registration - email - code - download British National Corpus sample CD - Biber - editor of "Grammar of Spoken and Written English" - incredibly expensive, very long, but there is a student edition. 1/10 price. Kende utca English language bookshop. John Firth - invented the term "collocation". Active in the 50s. "The meaning of a word can be told from the company it keeps." [Embert baratjarol.] Biber would include grammaticality in the factors of registers. 2005.04.20 1999 Symposium, 1996 article. CA - Contrastive Analysis CIA - Contrastive Interlanguage Analysis interlanguage - the language of non-native speakers Computerised bilingual corpora - CBC 1. translation corpus 2. parallel corpus -- Interlanguage vs. native language (English) of+ by- but+ and- so+ (sentence initial) hu:,+ THE|END _WORDNET_ WordNet - ontologies Ontologies are conceptualisations where other conceptualisations are possible. Ontologies are researched by cognitive liguistics. Conceptual ontologies - have words that are not categorised and lexicalised (?) ask.com framenet - restricted but deep wordnet - breath and not deep (est.1995), Prof. Miller wordnet.princeton.edu ask.com Micheal cosmos - not semantic purpose but for translation: that means 8 semantical different window meanings are not listed if they are both translated as "ablak" synsets - the default units of wordnet WordNet application: "The dog ran into the room. The animal jumped on the sofa." WordNet points out that "dog" and "animal" has the same reference. EuroWordnet www.rrz.uni-hamburg.de -- an online database of French and German metaphors. Aim: "Study the feasability of a systematic metaphor representation, including relations.." Using Wordnets. Source Target domain domain FAMILY <> PARTY THE|END 2005.05.04 "Contextual Effects in the Understanding/Disambiguation of Hungarian temporal -ig Constructions" -tol, -ig pragmatix, semantix, structuralix Bp-en a nyar szeptemberIG tart. - non inclusive (szeptember is the 1st month of autumn, not summer) Bp-en a nyar augusztusIG tart. - inclusive Boliviaban a nyar aprilisIG tart. - ambiguous Until/upto - different meaning in English/American Uses WebCONC corpus http://www.niederlendistik.fu-berlin.de/cgi-bin/web-conc.cgi This one searches for google entries and you can input the language as well. Az iskolai tanitas magyarorszagon pentekig tart. *Az iskolai tanitas magyarorszagon szombatig tart. Az iskolai tanitas magyarorszagon szombatig tartott. I've to go now sorry. THE|END 2005.05.11 _BOTTYAN GERGO - SEGMENTATION_ Characters: control and formatting characters, punctuation, numerics, alfa. Segmented languages vs. oriental languages: no sure word boundaries. Punctuation are separate tokens. Problems: when we use delimiters in an ambiguous way. Full stops: end of an abbreviation AND/OR end of a sentence. End of sentence then would be full stop followed by capital letter. Still not adequate. Tokenisation - A regular expression is a pattern that describes a set of strings. Operators, etc. PERL syntax [which is POSIX syntax] operators + characters 100 sentences 1 correct = precision 100% (pontossag), recall 1% (fedes) Brown Corpus or Wall Street Corpus - ideal test area, as they are already tokenised. Definitely 100% correct. Partly human edited. _Hyphens_ are difficult. New York-based 1/2 (if you want to search for it 2) self-assesment 1 F16 1 End-of-line hyphens - these you have to remove to tokenise first, BUT there are end-of-line hyphens that are ALSO usual hyphens. SO: co-, meta-, etc. - some people use hyphens some people not. Hyphenated Americans - [Indian-American African-American Hungarian-American ...] German linguists say that German compounds are one word, whereas English linguist require that English compounds should acquire a popularity before they are treated as 1 word. Apostrophe - Many things. The loss of the second token: haven't. O'Brian - part of the token. French too. Articles are abbreviated: l'avion. Oriental languages - there are no token boundaries. There are multi-character words. Proper names are written the same way as other words. ... THE|END (i've to go now) 2005.05.18 Special use corpora. _Translation Memory_ - store your parallell sentences if you are a translator, and search for terminology. Source: parallel texts with segmentation. M$ Access & another program. M$ Word: Delete pix. Convert charts into text. Change all the sentence ending marks to paragraph marks. All the semicolons and question marks and exclamation marks as well. Convert the text to a table: cellmark is paragraph mark (line break). How to arrange the two tables along each other? _M$ Access_ - new database, create tables. Type of cells: "Notes". Two columns. English; Hungarian. Cut'n'paste from M$ Word. How to parrallel the two tables? -- There's no method, you've to do it by hand! This is a database that we have and it is searchable, etc. ---------------------- _Wordfisher_ Wordfisher is a M$ Word implant. There is a "quick alignment" icon. ----------------------- _TRADOS_ Commercial program - demo version available. Now 7.0 vs. Two versions: freelance and for translation services - different licenses. Win-a-lign: the subprogram. When you download you can choose your languages. Supports xml, html, doc, xls, etc. Project: hu/eng. Add docs to each side. ... THE|END