mardi 8 avril 2014

python - ajouter des colonnes d'un fichier csv à listes - Stack Overflow


I have a dataset which is formated in a tab separated file. What i want to accomplish, is to append some of the columns of that file, to different lists.


The file i am reading is somewhat like this:


   temperature  station.id  latitude    longtitude  sea.distance    altitude

1 S7 0 4 0 75
2 S8 1 5 3 400
3 S8 1.5 2 4 80

Notice that the first column is the index value, with no header, while the second column temperature has no values.


Now i am using a csv.reader(infile, delimiter="\t") to read the file and append to create a columns list. Which as proven, is utterly wrong.


columns = []

for column in csv.reader(infile, delimiter="\t"):
columns.append(column)

I have searched a bit and i found several functions and ways that might (or might not) do the trick, but i am not sure as to which one i should use. Any suggestions? Thanks in advance


Edit: Result should be like this i think:


lat = [0,1,1.5]


A list for latitude values


Code so far:


#!/usr/bin/env Python

import csv

columns = []

with open("/path/to/file/file.txt") as infile:

for row in csv.reader(infile, delimiter="\t"):
columns.append(row[1])
print columns

Edit2: print row gives this:


['', 'temperature', 'station.id', 'latitude', 'longtitude', 'sea.distance', 'altitude']
[]
['1', '', '', '', 'S7', '0', '', '4', '', '0', '', '75']
['2', '', '', '', 'S8', '1', '', '5', '', '3', '', '400']
['3', '', '', '', 'S8', '1.5', '', '2', '', '4', '', '80']



Try the following:


>>> with open("test.csv", "rb") as f:
... latitudes = [x[5] for x in csv.reader(f, delimiter="\t") if x]
...
...
>>> latitudes
['0', '1', '1.5']

csv.reader iterates over the rows of your csv-file. The code grabs every sixth item (remember, indexing begins at 0) from each row if the row exists (or does not evaluate to False, e.g. empty list). It does so using a list comprehension. You could write that list comprehension as a regular for loop:


>>> for row in csv.reader(f, delimiter="\t"):
... if row:
... latitudes.append(row[5])
...
...



EDIT: Your example data seems to have a bunch of extra tabs. I've updated the answer to take this in account. You should fix your input file though, unless you want to run into more problems.


If you sanitize your input file, you could turn the data into a pandas.DataFrame. This allows easy manipulation and access of the csv data. Here's an example:


>>> data = pandas.DataFrame.from_csv("/tmp/test.csv", sep="\t")
>>> print data
index temperature station.id latitude longtitude sea.distance altitude

NaN NaN NaN NaN NaN NaN NaN
1 NaN S7 0.0 4 0 75
2 NaN S8 1.0 5 3 400
3 NaN S8 1.5 2 4 80

[4 rows x 6 columns]

>>> data['latitude']
index
NaN NaN
1 0.0
2 1.0
3 1.5
Name: latitude, dtype: float64
>>>



columns = []
for row in csv.reader(infile, delimiter="\t"):
columns.append(row[1]) # here row[1] is the second column


I have a dataset which is formated in a tab separated file. What i want to accomplish, is to append some of the columns of that file, to different lists.


The file i am reading is somewhat like this:


   temperature  station.id  latitude    longtitude  sea.distance    altitude

1 S7 0 4 0 75
2 S8 1 5 3 400
3 S8 1.5 2 4 80

Notice that the first column is the index value, with no header, while the second column temperature has no values.


Now i am using a csv.reader(infile, delimiter="\t") to read the file and append to create a columns list. Which as proven, is utterly wrong.


columns = []

for column in csv.reader(infile, delimiter="\t"):
columns.append(column)

I have searched a bit and i found several functions and ways that might (or might not) do the trick, but i am not sure as to which one i should use. Any suggestions? Thanks in advance


Edit: Result should be like this i think:


lat = [0,1,1.5]


A list for latitude values


Code so far:


#!/usr/bin/env Python

import csv

columns = []

with open("/path/to/file/file.txt") as infile:

for row in csv.reader(infile, delimiter="\t"):
columns.append(row[1])
print columns

Edit2: print row gives this:


['', 'temperature', 'station.id', 'latitude', 'longtitude', 'sea.distance', 'altitude']
[]
['1', '', '', '', 'S7', '0', '', '4', '', '0', '', '75']
['2', '', '', '', 'S8', '1', '', '5', '', '3', '', '400']
['3', '', '', '', 'S8', '1.5', '', '2', '', '4', '', '80']


Try the following:


>>> with open("test.csv", "rb") as f:
... latitudes = [x[5] for x in csv.reader(f, delimiter="\t") if x]
...
...
>>> latitudes
['0', '1', '1.5']

csv.reader iterates over the rows of your csv-file. The code grabs every sixth item (remember, indexing begins at 0) from each row if the row exists (or does not evaluate to False, e.g. empty list). It does so using a list comprehension. You could write that list comprehension as a regular for loop:


>>> for row in csv.reader(f, delimiter="\t"):
... if row:
... latitudes.append(row[5])
...
...



EDIT: Your example data seems to have a bunch of extra tabs. I've updated the answer to take this in account. You should fix your input file though, unless you want to run into more problems.


If you sanitize your input file, you could turn the data into a pandas.DataFrame. This allows easy manipulation and access of the csv data. Here's an example:


>>> data = pandas.DataFrame.from_csv("/tmp/test.csv", sep="\t")
>>> print data
index temperature station.id latitude longtitude sea.distance altitude

NaN NaN NaN NaN NaN NaN NaN
1 NaN S7 0.0 4 0 75
2 NaN S8 1.0 5 3 400
3 NaN S8 1.5 2 4 80

[4 rows x 6 columns]

>>> data['latitude']
index
NaN NaN
1 0.0
2 1.0
3 1.5
Name: latitude, dtype: float64
>>>


columns = []
for row in csv.reader(infile, delimiter="\t"):
columns.append(row[1]) # here row[1] is the second column

0 commentaires:

Enregistrer un commentaire