When there is " in XML, how to get the whole string?

I have a xml file.

<sst>
    <si><t>{&quot;key1&quot;:&quot;value1&quot;}</t></si>
    <si><t>{"key2":"value2"}</t></si>
</sst>

read it using

(print
 (xml->xexpr
  (document-element
   (read-xml (open-input-file "problem.xml")))))

I got the output

'(sst () "\r\n    " (si () (t () "{" "\"" "key1" "\"" ":" "\"" "value1" "\"" "}")) "\r\n    " (si () (t () "{\"key2\":\"value2\"}")) "\r\n")

the default way to handle the &quot; in string is split the string multiple items.
how can I get the value1 like value2 straightly without string-join them.

many years i have not parsed XML in Scheme and it was in Kawa but i will try to reply:
for value1 you can access by a combination of car (first) and cdr (rest)
for value2 i'm a bit puzzled because i'm not sure it ain't a bug in the XML parser library (i'm was facing that already in Kawa but here it is Racket) as you should have get in my opinion more deep parsing and i had expected to have value2 isolated ,not in a string with other data.
For now you have the solution as for value1 but you will have to finish by spliting the string. Using regular expression can help on the last point...

Now I just make do with

(string-join (cddr t) "")

it is so inconvenient that I add this line every time I parse xml.

Wait, can we take a step back? Where is this XML coming from? Looks like there's some quoting that needs to be undone before you parse this as XML.

Oh... interesting... No, actually, I do think the unquoting is being handled correctly. Yes, I think that the string-join here is appropriate.

You observe that it's inconvenient to be doing string-join, but I would also point out that it's massively inconvenient to ignore all of the newline-whitespace. The more fundamental problem here is that XML is a markup language, not a data encoding language.

But... it kind of looks like this used to be JSON, before it was XML. JSON is a way way nicer place to start from, is there any chance you can get ahold of the JSON that was maybe used to generate this XML?

I notice in python(xml.dom.minidom) the whole string can be return straightly.
I am not saying python is right. Each language has their own choice.

I also notice the two xexprs product different xml

(display-xml/content
 (xexpr->xml
  '(a quot "key1" quot)))

(display-xml/content
 (xexpr->xml
  '(a "\"key2\"" )))

output:

<a>&quot;
  key1&quot;
</a>
<a>
  "key2"
</a>

so, It makes sense that the read-xml results are different to tell them apart.

that is absolutely normal.

Just a comment : the use of backslash : \ in many language is used to disable the interpretation of the next character in the language .

as " is used to delimiter strings in many languages if you want this character in a string you have to 'backslash' it before.

And by consequence xexpr->xml will convert "key1" in key1 and convert "\"key2\"" in "key2"

Similar to what @jbclements said:

This XML file is... interesting.

  • JSON style data got wrapped in XML, because reasons.

  • Furthermore, the item with &quot;s seems like an artifact of something encoding JSON for use in HTML, not really XML?

  • And that happened only for one <t> item, not the other.

In short, looks like data from the real world. :slight_smile:

You'll probably need some pass where you do a certain amount of checking, cleansing, and normalizing, either before or after read-xml. Alas this may grow over time as you discover new varieties of "creative expression".


As @jbclements mentioned, it would be even better if you could get the data as pure JSON. The Racket json module works well, in my experience.

(It would also be fine if the data were pure XML. Something like <t><key>KEY</key><value>VALUE</value></t> is verbose but consistent.)

1 Like

thanks!

FYI
This xml is from unzipped xlsx file. As you know, xlsx is a package of xmls.
Some cells of a sheet are filled in with json string.

2 Likes

yes i really have to learn JSON :disappointed_relieved: i did not recognize it in the XML sample, it's melting of XML and JSON...

One way to understand why read-xml might work that way—though not necessarily whether it should!—is to imagine writing a parser. If you are parsing along in a text context and encounter &, it signals the start of an entity reference, so you might reasonably close out the pending parsed string before proceeding to parse the entity reference. When the ; closes the entity reference, you could recognize that &quot; is one of the predefined entities from the XML standard and helpfully represent it as "\"" rather than 'quot before resuming parsing in the text context. These design decisions definitely aren't ideal for every possible purpose—sometimes you might really prefer 'quot, and often a single string would be more convenient—but they are an understandable balance for a general-purpose library.

I've done a lot of XML processing in Racket, and it works great, but you absolutely will need to do post-processing to get the data into a more convenient form to work with, especially if the XML is generated by something out of your control.

As your example highlights, the use of &quot; is entirely unnecessary except in attribute values, but it is well-formed, and some real-world encoders generate it, so you have to be prepared to handle it. At the extreme, a document could use numeric character references for every single character, so you need to be prepared to use integer->char, and <![CDATA[...]]> sections are wrapped in a special struct, even though, at the level of the XML infoset, these are all just interchangeable concrete syntaxes for character data/character information items.

You will need to handle whitespace. At a minimum, since the xml library isn't a "validating XML processor", it can't help with insignificant whitespace in element content, even if you have a DTD defining it as such. In practice, many uses of XML have semantics for whitespace that can't be expressed by a DTD anyway, like the way that, in HTML, whitespace is collapsed in most (but not all) elements.

Similarly, IMHO the biggest weakness of XML is that there aren't really built-in semantics for any datatype but strings and elements. For a specific application of XML, you may want to handle, say, elements for which the order of the children doesn't matter, or you may want to parse the character data of specific elements or attributes as JSON, numbers, booleans, etc.

But you have to do most, if not all, of these things with any generic XML library to work with some specific application of XML. While there are a few things I might change in my personal ideal XML library, the biggest win is getting from XML to lists, symbols, and strings, which Racket then gives you many fantastic tools to transform as you like.

2 Likes