Extract Excel attributes to CSV

Posted by in Data, Design, GIS

This is a python solution to pull specific cells from a bunch of excel files into one CSV file. The source excel files have a consistent format – that is, a given cell in every excel file contains the same information.

I wrote this script becauseĀ I needed one compiled table with spatial information for geocoding but the information existed across dozens of excel files. This script was used to pull project information including name and location coordinates out of each individual project information excel file.

So, it was either open each excel file and copy the information needed to a new table that I could geocode, or write a script to pull the information into one table so that I could geocode them all at simultaneously. With dozens of excel files, I decided to script it.

The script depends on the xlrd and glob modules. The xlrd library can be downloaded here.

The script below is commented thoroughly so I wont go into much detail, comment if you have specific questions.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
#import modules needed
import glob
import xlrd
 
#target csv to wrtie lines to
text=r"C:\Users\brian\Desktop\xls\00_combined.csv"
 
#folder with source excel files
dir = r"C:\Users\brian\Desktop\xls"
 
# create list of excel files in source folder to pass to excel reader
xls = glob.glob(dir+"/*.xls*")
 
#create empty
linefull = []
 
#populates the linefull list with the attributes needed from each excel file - iterates through each excel file in the xls list
for path in xls:
  #new empty list for each excel file's attributes
  line = []
  #open the current excel file
  book = xlrd.open_workbook(path)
  #target the first sheet
  first_sheet = book.sheet_by_index(0)
  #list the rows corresponding to the data needed in each excel file, 0-based index, ie row 5 is 4
  rows = [9,20,21,22,44,45]
  #all values in same column (2 is C, A is 0, B is 1, C is 2)
  col = 2
  #one attribute is not in column 2, that attribute is pulled in the following line
  line.append(first_sheet.cell_value(38,8))
  #this row pulls the attribute for each cell in column 2 at each row in the rows list
  for row in rows:
    #pulls the attribute for the current row in the rows list
    line.append(first_sheet.cell_value(int(row),int(col)))
  #appends each list of attributes for the current excel file to the master linefull list
  linefull.append(line)
 
#this loop formats and writes each line of attributes from the linefull list
with open(text,"w") as text_file:
  #writes the column headers
  text_file.write("number,project,area,state,country,latitude,longitude")
  #loops through each list of attributes within the list of all excel file attributes
  for x in linefull:
    #new empty list that will contain each list of attributes in csv format
    newlist = []
    #loops through each individual attribute for each excel file's list of attributes
    for y in x:
      #formats the attribute as string
      z = str(y)
      #passes the string conversion of each attribute to the newlist
      newlist.append(z)
    #creates a comma separated string of attributes for the current excel file
    t = ",".join(newlist)
    #print to confirm formatting
    print t
    #variable to write a new line and the comma separated attributes to the csv file
    ln = "\n"+t
    #writes the line to the csv/text file defined at the beginning of the script
    text_file.write(ln)

Here’s the code without the comments, a little clearer to read the code.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
import glob
import xlrd
 
text=r"C:\Users\greer\Desktop\xls\00_combined.csv"
dir = r"C:\Users\greer\Desktop\xls"
xls = glob.glob(dir+"/*.xls*")
 
linefull = []
 
for path in xls:
  line = []
  book = xlrd.open_workbook(path)
  first_sheet = book.sheet_by_index(0)
  rows = [9,20,21,22,44,45]
  col = 2
  line.append(first_sheet.cell_value(38,8))
  for row in rows:
    line.append(first_sheet.cell_value(int(row),int(col)))
  linefull.append(line)
 
with open(text,"w") as text_file:
  text_file.write("number,project,area,state,country,latitude,longitude")
  for x in linefull:
    newlist = []
    for y in x:
      z = str(y)
      newlist.append(z)
    t = ",".join(newlist)
    print t
    ln = "\n"+t
    text_file.write(ln)

Thanks!

Leave a Reply