Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
527 views
in Technique[技术] by (71.8m points)

dataframe - What can R do about a messy data format?

Sometimes I see data posted in a Stack Overflow question formatted like in this question. This is not the first time, so I have decided to ask a question about it, and answer the question with a way to make the posted data palatable.

I will post the dataset example here just in case the question is deleted.

+------------+------+------+----------+--------------------------+
|    Date    | Emp1 | Case | Priority | PriorityCountinLast7days |
+------------+------+------+----------+--------------------------+
| 2018-06-01 | A    | A1   |        0 |                        0 |
| 2018-06-03 | A    | A2   |        0 |                        1 |
| 2018-06-03 | A    | A3   |        0 |                        2 |
| 2018-06-03 | A    | A4   |        1 |                        1 |
| 2018-06-03 | A    | A5   |        2 |                        1 |
| 2018-06-04 | A    | A6   |        0 |                        3 |
| 2018-06-01 | B    | B1   |        0 |                        1 |
| 2018-06-02 | B    | B2   |        0 |                        2 |
| 2018-06-03 | B    | B3   |        0 |                        3 |
+------------+------+------+----------+--------------------------+

As you can see this is not the right way to post data. As a user wrote in a comment,

It must've taken a bit of time to format the data the way you're showing it here. Unfortunately this is not a good format for us to copy & paste.

I believe this says it all. The asker is well intended and it took some work and time to try to be nice, but the result is not good.

What can R code do to make that table usable, if anything? Will it take a great deal of trouble?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

Using data.table::fread:

x = '
+------------+------+------+----------+--------------------------+
|    Date    | Emp1 | Case | Priority | PriorityCountinLast7days |
+------------+------+------+----------+--------------------------+
| 2018-06-01 | A    | A1   |        0 |                        0 |
| 2018-06-03 | A    | A2   |        0 |                        1 |
| 2018-06-03 | A    | A3   |        0 |                        2 |
| 2018-06-03 | A    | A4   |        1 |                        1 |
| 2018-06-03 | A    | A5   |        2 |                        1 |
| 2018-06-04 | A    | A6   |        0 |                        3 |
| 2018-06-01 | B    | B1   |        0 |                        1 |
| 2018-06-02 | B    | B2   |        0 |                        2 |
| 2018-06-03 | B    | B3   |        0 |                        3 |
+------------+------+------+----------+--------------------------+
'

fread(gsub('\+.+\n' ,'', x, perl = T), drop=c(1,7))

#          Date Emp1 Case Priority PriorityCountinLast7days
# 1: 2018-06-01    A   A1        0                        0
# 2: 2018-06-03    A   A2        0                        1
# 3: 2018-06-03    A   A3        0                        2
# 4: 2018-06-03    A   A4        1                        1
# 5: 2018-06-03    A   A5        2                        1
# 6: 2018-06-04    A   A6        0                        3
# 7: 2018-06-01    B   B1        0                        1
# 8: 2018-06-02    B   B2        0                        2
# 9: 2018-06-03    B   B3        0                        3

The gsub part removes the horizontal rules. drop removes the extra columns caused by delimiters at the line ends.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...