您的位置:首页 > 其它

RH033 Unit8 Text Processing Tools

2009-03-29 14:36 537 查看
Objectives 1) Upon completion of this unit, you should be able to: - Use tools for extracting, analyzing and manipulating text data Tools for Extracting Text 1) File contents: less and cat 2) File Experts: head and tail 3) Extract by Column: cut 4) Extract by Keyworld: grep Viewing File Contents - less and cat 1) Cat: dump one or more files to STDOUT - Multiple files are concatenated together 2) less: view file or STDIN on page at a time - Useful commands while viewing: /text searches for text
n/N jumps to the next/previous match
v opens the file in a text editor
- less is the pager used by man Viewing File Excerpts - head and tail 1) head: Display the first 10 lines of a file - Use –n to change number of lines displayed 2) tail: Display the last 10 lines of a file - Use –n to change the number of lines displayed - Use –f to follow the subsequent additions to the file Very useful for monitoring the log files!
Extracting Text by keyword - grep 1) Print lines of files to STDIN where a pattern is matched grep ‘john’ /etc/passwd
date –help | grep year
2) Use –i to search case-insensitively 3) Use –n to print line number of matches 4) Use –v to print lines not containing pattern 5) Use –Ax to include the x lines after each match 6) Use –Bx to include the x lines before each match 7) Use –l to return the name of the files that containing the pattern Extracting Text by Column - cut 1) Display a specified columns of file or STDIN data cut –d: –f1 /etc/passwd
grep root /etc/passwd | cut –d: –f7
2) Use –d to specify the column delimiter (default is TAB) 3) Use –f to specify the column to print 4) Use –c to cut by characters cut –c2 –5 /etc/share/dict/words
Tools for Analyzing Text 1) Text Stats: wc 2) Sorting text: sort 3) Comparing files: diff and match 4) Spell check: aspell Gathering Text Statistics - wc (word count) 1) Counts words, lines, bytes and characters 2) Can act upon a file or STDIN 3) Use –l for only line count 4) Use –w for only word count 5) Use –c for only byte count 6) Use –m for character count (not displayed) Sorting Text - sort 1) Sort text to STDOUT – original file unchanged sort [options] file(s)
2) Common options -r performs a rerverse (descending) sort
-n performs a numeric sort
-f ignores (folds) case of characters in strings
-u (unique) removes duplicate lines in output
-t c uses c as a field separator
-k x sorts by c-delimited field x, can be used mutiple times
  sort –t : –k 3 –n /etc/passwd Eliminating Duplicate Lines – sort and uniq 1) sort –u: removes duplicated lines from input 2) uniq: removes duplicate adjacent lines from input Use –c to count number of occurences
Use with sort for best effect:
sort userlist.txt | uniq –c Comparing Files – diff 1) Compares two files for differences 2) Use gvimdiff for graphical diff Provided by vim-X11 package
Duplicating File Changes – patch 1) diff output stored in a file is called a “patchfile” Use –u for “unified” diff, best in patchfiles
2) patch duplicates changes in other files (use with care!) Use –b to automatically backup changed file
diff –u foo.conf-broken foo.conf-works > foo.patch patch –b foo.conf-broken foo.patch Spell Checking with aspell 1) Interactively spell-check files: aspell check letter.txt
2) Non-interactively list mis-spelled words in STDIN aspell list < letter.txt
aspell list < letter.txt | wc –l
Tools for Manipulating Text – tr and sed 1) Alter (translate) Character: tr Converts characters in one set to corresponding characters in another set
Only reads data from STDIN
        $ tr ‘a-z’ ‘A-Z’ < lowercase.txt 2) Alter Strings: sed stream editor
Performs search/replace operations on a stream of text
Normally does not alter source file
Use –i.bak to backup and alter source file
-i : case-insensitive
-g: global
   sed ‘s/cat/dog/’ pets    sed ‘s/cat/dog/gi’ pets Sed Examples 1) Quote search and replace instructions! 2) sed addresses sed ‘s/dog/cat/g’ pets
sed ‘1,50s/dog/cat/g’ pets ###the replacement will only be performed on lines 1 to 50
sed ‘digby/,/duncan/s/dog/cat/g’ pets ###the replacement will only start on the line that contains the string ‘digby’ and continuing through the line that contains ‘duncan’
3) Multiple sed instructions sed –e ‘s/dog/cat/’ –e ‘s/hi/lo/’ pets
sed –f myedits pets
Special Characters for Complex Searches Regular Expressions 1) ^ represents beginning of line 2) $ represents end of line 3) Character classes as in bash: [abc], [^abc]
[[:upper:]], [^[:upper:]]
4) Used by: grep, sed, less, others
End of Unit8 1) Questions and Answers 2) Summary Extracting Text: cat, less, head, tail, grep, cut
Analyzing Text: wc, sort, uniq, diff, patch
Manipulating Text: tr, sed
Special Search Characters: ^, $, [abc], [[:alpha:]], [^[:alpha:]], etc
[:digit:]
Only the digits 0 to 9 [:alnum:]
Any alphanumeric character 0 to 9 OR A to Z or a to z. [:alpha:]
Any alpha character A to Z or a to z. [:blank:]
Space and TAB characters only. [:xdigit:]
Hexadecimal notation 0-9, A-F, a-f. [:punct:]
Punctuation symbols . , " ' ? ! ; : # $ % & ( ) * + - / < > = @ [ ] / ^ _ { } | ~ [:print:]
Any printable character. [:space:]
Any whitespace characters (space, tab, NL, FF, VT, CR). Many system abbreviate as /s. [:graph:]
Exclude whitespace (SPACE, TAB). Many system abbreviate as /W. [:upper:]
Any alpha character A to Z. [:lower:]
Any alpha character a to z. [:cntrl:]
Control Characters NL CR LF TAB VT FF NUL SOH STX EXT EOT ENQ ACK SO SI DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC IS1 IS2 IS3 IS4 DEL.
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息