Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
1.2k views
in Technique[技术] by (71.8m points)

read a csv file with quotation marks and regex R

ne,class,regex,match,event,msg
BOU2-P-2,"tengigabitethernet","tengigabitethernet(?'connector'd{1,2}/d{1,2})","4/2","lineproto-5-updown","%lineproto-5-updown: line protocol on interface tengigabitethernet4/2, changed state to down"

these are the first two lines, with the first one that will serve as columns names, all separated by commas and with the values in quotation marks except for the first one, and I think it is that that creates troubles.

I am interested in the columns class and msg, so this output will suffice:

class              msg
tengigabitethernet %lineproto-5-updown: line protocol on interface tengigabitethernet4/2, changed state to down

but I can also import all the columns and unselect the ones I don't want later, it's no worries.

The data comes in a .csv file that was given to me. If I open this file in excel the columns are all in one. I work in France, but I don't know in which locale or encoding the file was created (btw I'm not French, so I am not really familiar with those).

I tried with

df <- read.csv("file.csv", stringsAsFactors = FALSE)

and the dataframe has the columns' names nicely separated but the values are all in the first one

then with

    library(readr)
df <- read_delim('file.csv', 
           delim = ",", 
           quote = "",
           escape_double = FALSE, 
           escape_backslash = TRUE)

but this way the regex column gets splitted in two columns so I lose the msg variable altogether.

With

    library(data.table)
df <- fread("file.csv")

I get the msg variable present but empty, as the ne variable contains both ne and class, separated by a comma. this is the best output for now, as I can manipulate it to get the desired one.

another option is to load the file as a character vector with readLines to fix it, but I am not an expert with regexs so I would be clueless. the file is also 300k lines, so it would be hard to inspect it.

both read.delim and fread gives warning messages, I can include them if they might be useful.

update:

using

library(data.table)
df <- fread("file.csv", quote = "")

gives me a more easily output to manipulate, it splits the regex and msg column in two but ne and class are distinct

question from:https://stackoverflow.com/questions/65920644/read-a-csv-file-with-quotation-marks-and-regex-r

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

I tried with the input you provided with read.csv and had no problems; when subsetting each column is accessible. As for your other options, you're getting the quote option wrong, it needs to be """; the double quote character needs to be escaped i.e.: df <- fread("file.csv", quote = """). When using read.csv with your example I definitely get a data frame with 1 line and 6 columns:

df <- read.csv("file.csv")
nrow(df)
# Output result for number of rows
# > 1
ncol(df)
# Output result for number of columns
# > 6

tmp$ne
# > "BOU2-P-2"
tmp$class
# > "tengigabitethernet"
tmp$regex
# > "tengigabitethernet(?'connector'\d{1,2}\/\d{1,2})"
tmp$match
# > "4/2"
tmp$event
# > "lineproto-5-updown"
tmp$msg
# > "%lineproto-5-updown: line protocol on interface tengigabitethernet4/2, changed state to down"

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...