Research – Junghyun Min

Geography

Ever since my uncle bought me a pocket atlas in 2nd grade, I’ve always loved to read, draw, and study maps. My early projects in geography (mostly cartography, actually) span from a once 2500+ member geography forum (2009) to substance-less maps depicting a fictional world (2010). My interest naturally drew me to a GIS course, where I compiled a study of Baltimore’s land use from the 1800s to the present (2014). My interest in maps culminated in a computational cartography project with Wolfram, where I trained computers to read maps (2016).

Waltham, 2016

Physics

Maps are models. They present a simplified version of the spherical Earth. In a similar sense, physicists also strive to describe astronomically complex natural phenomena in simple, elegant models. The joy I find in physics stems from the same source where that in geography comes from.

I studied the muon and its atmospheric flux at Max-Planck-Institut für Kernphysik (2013), then went on to pursue a degree in physics, along with math, at the Johns Hopkins University. Some interesting projects at Hopkins include measuring galactic motion (2017) and quantum properties of graphene (2017).

Heidelberg, 2013

Computational linguistics

Join my longtime passion for language with a field I am given more credit than is due (B.A. in math), and we get computational linguistics. Broadly, I am interested in language representation in humans and machines. More specifically, my work at Johns Hopkins University’s Computation and Psycholinguistics, with Tom McCoy and Tal Linzen involves improving syntactic generalization abilities in language models.

Johns Hopkins University Cognitive Science

BERT fine-tuned on MNLI and is unstable and vulnerable to syntactic heuristics (McCoy, Min, Linzen 2021).
Adversarial data augmentation via syntactic manipulation of training set data significantly increases robustness to augmentation-like examples and general syntactic sensitivity too (Min, McCoy, Das, Pitler, Linzen 2020).
Heuristics likely arise from both the pre-training and the fine-tuning dataset. Currently popular fine-tuning and evaluation paradigm has drawbacks that can be patched with longer fine-tuning on unbiased datasets, multi-seed out-of-distribution evaluation, and syntactic adversarial augmentation (Master’s thesis).

Baltimore, 2019. Photo by Brian Leonard, CAP Lab

forus.ai

If you have some hot takes on whether AI and art mix well, I’ve developed an AI songwriter with forus.ai. We got an artist to rap an excerpt–you can listen to it here.

wecommit

If ChatGPT is so good, why don’t we delegate our paperwork? Accelerated by Antler, wecommit is our attempt at narrow-domain paperwork automation.

NCSOFT

At NCSOFT, I primarily worked on structure-related tasks like chunking, information extraction, salience measurement, and literary understanding. My projects included:

Natural language understanding, Language AI Lab

Concatenation of light embeddings yields a fast, light, and effective chunking system capable of processing up to 10k requests per seconds on less than 4GB of GPU memory (2021).
Set of 10 naïve rules applied on training set significantly improves resulting information extraction model performance (2021 – 22).

Financial language understanding, NLP Center

An ensemble of rule- and transformer-based noise detection system to improve open information extraction (Goldie and Min, 2022)
A simple TF-IDF based importance metric effectively ranks events within a temporal window (2022 – 23).
Large language models can be used to augment relation classification datasets and improve out-of-distribution, out-of-domain performance (Kim and Min, 2023)
Punctuation restoration as unsupervised representation learning objective improves structure understanding (2023) tech blog (Korean) abstract
Modifications in loss and decoding can yield a structure prediction model from a natural language generation model (Lee, Min, Lee, and Lee 2023) arXiv

Creative AI, Research & Innovation

Computational processing of literary devices and related baseline performances (2024)

Georgetown University Linguistics

In the fall of 2024, I started as a PhD student at Georgetown Linguistics.

Universal Dependencies compatibility updates to the Korean SNACS (Sentiment Network of Adposition and Case Supersenses) dataset (Hwang et al. 2020)
Additions to the Georgetown University Multilayer corpus (Zeldes et al. 2017)

I like to think and talk about roles of encoder representations, natural language understanding, and syntax among other linguistic features in the era of large autoregressive generative models. My past and current projects encompass my interest in human-like computational representations of structure in language.