Long ago, in Seventh Edition UNIX (a/k/a V7), there was a program called
prep. Its primary use was to take files of text, and break them up into one word per line, for further processing by other tools in a pipeline. It could do a little bit of other manipulation too, like telling you the location of each individual word within a file, ignoring specific words in an ignore list, or only paying attention to words specifically mentioned in an include list. It’s sort of difficult to explain exactly what it does, but here is a man page from 2.9BSD for it. It had an assortment of interesting uses – for example, building dictionaries, spell-checkers, and the like.
This program was rather short lived. It only existed in V7 and a couple of offshoots (and 2.9BSD was basically an offshoot of V7). It didn’t previously exist in V6. It was removed from V8. It never even made it into 4.2BSD. It doesn’t exist (at least not in this form) in any Linux distribution that I’m aware of, nor in FreeBSD and friends. There was another program that also (as far as I am aware) first appeared on V7, called
deroff, that was primarily for a completely different purpose – but it had a "
-w" option that told it to do the "split up files into one word per line" thing, similar to
prep, but didn’t do any of the other functions (like word numbering, include lists, and ignore lists). I assume for purposes like dictionary building,
deroff -w subsumed the function of
prep. That was comparatively much longer lived – but these days, there doesn’t even seem to be a version of
deroff packaged for any major Linux distribution, I know it’s not in any recent version of RHEL, it’s not in Fedora 32, and it’s not in Debian 10 (but I’m pretty sure it actually was in Debian until not that long ago).
prep go away? Was it really because
deroff -w duplicated most of its function? I presume that
deroff has disappeared in current Linux distributions because people generally don’t deal with [nt]roff-formatted documents anymore, except maybe for man pages. But with both of these tools gone, what can one use to do the "split up a text file into one word per line" function? Is there anything packaged for any modern Linux distro that would perform this function? (If you’re going to respond with, "you can probably do this yourself with a simple script", I concede that is probably correct – but that is not the answer I’m looking for right now, I’m looking for a way to do this with some existing tool that already exists in modern Linux distributions…) Ideally, I’d like to find something that implements all the features listed in the man page I linked (plus the "implied" behaviors that aren’t explicitly specified in the man page, like not considering punctuation to be part of a word, and how numbers that appear as part of a "word" are handled). 🙂 Practically, I don’t think the include and exclude lists are particularly crucial, and while I’d like to have the word numbering (it can sometimes be handy to know the location of a word in a file), it’s not that important. Handling of hyphenated words at the end of a line would be desirable.
Using Raku (formerly known as Perl6)
~$ raku -ne '.words.join("n").put;' < file
Answered by jubilatious1 on November 28, 2020
It seems like
tr -s " " "n" < file ought to work for splitting a file to one word per line.
Answered by tim1724 on November 28, 2020
2 Asked on January 24, 2021 by codeholic24
0 Asked on January 24, 2021 by hahasajk
4 Asked on January 24, 2021 by nitins
0 Asked on January 23, 2021 by mat-contras
1 Asked on January 23, 2021 by strae
1 Asked on January 23, 2021 by daniel-quinn
1 Asked on January 22, 2021 by bluekingwhite
2 Asked on January 21, 2021 by elegant-codeworks
0 Asked on January 21, 2021 by calumah
1 Asked on January 21, 2021 by ava
0 Asked on January 21, 2021 by novicefedora
1 Asked on January 20, 2021 by plugwash
1 Asked on January 20, 2021 by hakk
0 Asked on January 20, 2021 by valentin
1 Asked on January 19, 2021 by user10345633
Get help from others!