Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
806 views
in Technique[技术] by (71.8m points)

regex - how to use a regular expression to extract json fields?

Beginner RegExp question. I have lines of JSON in a textfile, each with slightly different Fields, but there are 3 fields I want to extract for each line if it has it, ignoring everything else. How would I use a regex (in editpad or anywhere else) to do this?

Example:

"url":"http://www.netcharles.com/orwell/essays.htm",
"domain":"netcharles.com",
"title":"Orwell Essays & Journalism Section - Charles' George Orwell Links",
"tags":["orwell","writing","literature","journalism","essays","politics","essay","reference","language","toread"],
"index":2931,
"time_created":1345419323,
"num_saves":24

I want to extract URL,TITLE,TAGS,

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)
/"(url|title|tags)":"(("|[^"])*)"/i

I think this is what you're asking for. I'll provide an explanation momentarily. This regular expression (delimited by / - you probably won't have to put those in editpad) matches:

"

A literal ".

(url|title|tags)

Any of the three literal strings "url", "title" or "tags" - in Regular Expressions, by default Parentheses are used to create groups, and the pipe character is used to alternate - like a logical 'or'. To match these literal characters, you'd have to escape them.

":"

Another literal string.

(

The beginning of another group. (Group 2)

    (

Another group (3)

        "

The literal string " - you have to escape the backslash because otherwise it will be interpreted as escaping the next character, and you never know what that'll do.

        |

or...

        [^"]

Any single character except a double quote The brackets denote a Character Class/Set, or a list of characters to match. Any given class matches exactly one character in the string. Using a carat (^) at the beginning of a class negates it, causing the matcher to match anything that's not contained in the class.

    )

End of group 3...

    *

The asterisk causes the previous regular expression (in this case, group 3), to be repeated zero or more times, In this case causing the matcher to match anything that could be inside the double quotes of a JSON string.

)"

The end of group 2, and a literal ".

I've done a few non-obvious things here, that may come in handy:

  1. Group 2 - when dereferenced using Backreferences - will be the actual string assigned to the field. This is useful when getting the actual value.
  2. The i at the end of the expression makes it case insensitive.
  3. Group 1 contains the name of the captured field.

EDIT: So I see that the tags are an array. I'll update the regular expression here in a second when I've had a chance to think about it.

Your new Regex is:

/"(url|title|tags)":("("|[^"])*"|[("("|[^"])*"(,"("|[^"])*")*)?])/i

All I've done here is alternate the string regular expression I had been using ("(("|[^"])*)"), with a regular expression for finding arrays ([("("|[^"])*"(,"("|[^"])*")*)?]). No so easy to Read, is it? Well, substituting our String Regex out for the letter S, we can rewrite it as:

[(S(,S)*)?]

Which matches a literal opening bracket (hence the backslashes), optionally followed by a comma separated list of strings, and a closing bracket. The only new concept I've introduced here is the question mark (?), which is itself a type of repetition. Commonly referred to as 'making the previous expression optional', it can also be thought of as exactly 0 or 1 matches.

With our same S Notation, here's the whole dirty Regular Expression:

/"(url|title|tags)":(S|[(S(,S)*)?])/i

If it helps to see it in action, here's a view of it in action.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...