Paste number 28840: word-frequency count

Index of paste annotations: 1 | 2 | 3

Paste number 28840: word-frequency count
Pasted by: xahlee
When:12 years, 9 months ago
Share:Tweet this! | http://paste.lisp.org/+M94
Channel:#emacs
Paste contents:
Raw Source | XML | Display As
# -*- coding: utf-8 -*-
# Python

# purpose: compile the frequency of words in a file
# 2006-10-27
#    Xah
#   xah@xahlee.org
# ∑ http://xahlee.org/ 

# read in a file
# delet comment lines
# break it into a list wordlist, seperated by space or any delimiters
# for each in wordlist, add it into a hash key, and if already exist, increase by 1

import os,sys,shutil,re, operator

mydir = '/Applications/Emacs.app/Contents/Resources/share/emacs/22.0.50/lisp'

minLevel=1; # files and dirs of mydir are level 1.
maxLevel=1; # inclusive

# keys are words, vals are num of occurances
wordFreq={}

def countMe(filePath):
   "add words frequency into wordFreq"

   print 'reading:', filePath
   inF = open(filePath,'rb')
   s=unicode(inF.read(),'iso-8859-1')
   inF.close()

   s=s.splitlines()
   s=filter(lambda x: not re.search(r"\s*;", x), s)
   s=','.join(s)

   wordlist = re.split(r'[() ,]+',s);
   for wd in wordlist:
        if wordFreq.has_key(wd):
            wordFreq[wd]=wordFreq[wd]+1
        else:
            wordFreq[wd]=1

def getInto(dummy, curdir, filess):
   curdirLevel=len(re.split('/',curdir))-len(re.split('/',mydir))
   filessLevel=curdirLevel+1
   if minLevel <= filessLevel <= maxLevel:
      for child in filess:
         if 'xml.el' == child:
#         if re.search(r'\.el$',child,re.U) and os.path.isfile(curdir+'/'+child):
             countMe(curdir+'/'+child)

os.path.walk(mydir, getInto, 'dummy')

for k,v in sorted(wordFreq.iteritems(), key=operator.itemgetter(1) ,reverse=True):
#    print k.encode('utf-8'),v
    print k,v

#perl -wlne'$h{$&}++while/\w+/g}{print"$f\t$w"while($w,$f)=each%h'
#perl -pe '$a{$_}++for+split}{$_=join"\n",sort{$a{$b}<=>$a{$a}}keys%a' ?

Annotations for this paste:

Annotation number 1: problems with doc string
Pasted by: xahlee
When:12 years, 9 months ago
Share:Tweet this! | http://paste.lisp.org/+M94/1
Paste contents:
Raw Source | Display As
the problem with this script is that it doesn't filter out doc strings

it's gonna be hard to work around it.

Unless, one rewrite this in lisp since lisp has readers that understand lisp code and can easily filter out the doc string.

but, i don't know lisp enough to tackle it. (i can, but will take hours and hours)

anyone can help out?

Annotation number 2: goal of the program
Pasted by: xahlee
When:12 years, 9 months ago
Share:Tweet this! | http://paste.lisp.org/+M94/2
Paste contents:
Raw Source | Display As
my goal, is to compile a report of the most frequently used functions in various languages.

In particular, my interest are to do this for: elisp, and perl. But possibly also python, java.

Annotation number 3: elisp version
Pasted by: bpalmer
When:12 years, 9 months ago
Share:Tweet this! | http://paste.lisp.org/+M94/3
Paste contents:
Raw Source | Display As
;;;  Count the frequency of functions and things that look like them in elisp Programs
;;; Licensed under, say, BSD (with attribution requirement)

(defun freqcount-buffer (&optional buffer)
  (interactive)
  (unless buffer
    (setq buffer (current-buffer)))
  (with-current-buffer buffer
    (save-excursion
      (goto-char (point-min))
      (condition-case nil
	  (loop
	    do 
	    (setq exp (read buffer))
	    (freqcount-codewalk exp))
	(end-of-file exp)))))

(defvar freqcount-hashtable (make-hash-table))

(defun freqcount-incr (symbol)
  (unless (symbolp symbol)
    (error (format "%s is not a symbol" symbol)))
  (puthash symbol (1+ (gethash symbol freqcount-hashtable 0)) freqcount-hashtable))

(defun freqcount-codewalk (expression)
  (cond
   ((atom expression)   ; nil ... could be a comment, or a stray nil blowing in the wind, or a value ... ignore
    )
   ((listp (car expression))  ; either a lambda, or must be a literal, or invalid... check for lambda, ignore otherwise
    ; Ignore cmopletely for now
    )
   ((equal (car expression) 'quote) 
    (freqcount-incr 'quote))
   ((equal (car expression) 'defun)
    (freqcount-incr 'defun)
    ; we skip the second argument (name) and third argument (arglist); is there a docstring?
    (if (stringp (fourth expression))
	(freqcount-codewalk (cddddr expression))
      (freqcount-codewalk (cddr expression))))
    (t
     (freqcount-incr (car expression))
     (freqcount-codewalk (cdr expression)))))

Colorize as:
Show Line Numbers

Lisppaste pastes can be made by anyone at any time. Imagine a fearsomely comprehensive disclaimer of liability. Now fear, comprehensively.